Big Data Analytics for Wireless and Wired Network Design: A Survey

Currently, the world is witnessing a mounting avalanche of data due to the increasing number of mobile network subscribers, Internet websites, and online services. This trend is continuing to develop in a quick and diverse manner in the form of big data. Big data analytics can process large amounts of raw data and extract useful, smaller-sized information, which can be used by different parties to make reliable decisions. In this paper, we conduct a survey on the role that big data analytics can play in the design of data communication networks. Integrating the latest advances that employ big data analytics with the networks control/traffic layers might be the best way to build robust data communication networks with refined performance and intelligent features. First, the survey starts with the introduction of the big data basic concepts, framework, and characteristics. Second, we illustrate the main network design cycle employing big data analytics. This cycle represents the umbrella concept that unifies the surveyed topics. Third, there is a detailed review of the current academic and industrial efforts toward network design using big data analytics. Forth, we identify the challenges confronting the utilization of big data analytics in network design. Finally, we highlight several future research directions. To the best of our knowledge, this is the first survey that addresses the use of big data analytics techniques for the design of a broad range of networks.


Introduction
Networks generate traffic in rapid, large, and diverse ways, which leads to an estimate of 2.5 exabytes created per day [1].There are many contributors to the increasing size of the data.For instance, scientific experiments can generate lots of data, such as CERN's Large Hadron Collider (LHC) that generates over 40 petabyte each year [2].Social media also has its share, with over 1 billion users, spending an average 2.5 hours daily, liking, tweeting, posting, and sharing their interests on Facebook and Twitter [3].It is without a doubt that using this activity-generated data can affect many aspects, such as intelligence, e-commerce, biomedical, and data communication network design.However, harnessing the powers of this data is not an easy task.To accommodate the data explosion, data centers are being built with massive storage and processing capabilities, an example of which is the National Security Agency (NSA) Utah data centre that can store up to 1 yottabyte of data [4], and with a processing power that exceeds 100 petaflops [5].Due to the increased needs to scale-up databases to data volumes that exceeded processing and/or storage capabilities, systems that ran on computer clusters started to emerge.Perhaps the first milestone took place in June 1986 when Teradata [6] used the first parallel database system (hardware and software), with one terabyte storage capacity, in Kmart data warehouse to have all their business data saved and available for relational queries and business analysis [7,8].Other examples include the Gamma system of the University of Wisconsin [9] and the GRACE system of the University of Tokyo [10].
In light of the above, the term "Big Data" emerged, and it can be defined as high-volume, high-velocity, and high-variety data that provides substantial opportunities for cost-effective decision-making and enhanced insight through advanced processing which extracts information and knowledge from data [11].Another way to define big data is by saying it is the amount of data that is beyond traditional technology capabilities to store, manage, and process in an efficient and easy way [12].Big data is already being employed by digitalborn companies like Google and Amazon to help these companies with data-driven decisions [13].It also helps in the development of smart cities and campuses [14], as well as in other fields like agriculture, healthcare, finance [15], and transportation [16].Big data has the following characteristics: 1-Volume: This is a representation of the data size [17].2-Variety: Generating data from a variety of sources results in a range of data types.These data types can be structured (e.g.e-mails), semi-structured (e.g.log files data from a webpage); and unstructured (e.g.customer feedback), and hybrid data [18].3-Velocity: Is an indication of the speed of the data when being generated, streamed, and aggregated [19].It can also refer to the speed at which the data has to be analyzed to maintain relevance [17].Depending on the research area and the problem space, other terms or Vs can be added.For example, is this data of any value?How long can we consider this an accurate and valid data?Since we are conducting a survey, we find it compelling to briefly introduce other Vs as well.Typically, the number of analyzed Vs is 3 to 7 in a single paper (e.g.6V+C [20]), where C represents Complexity, however, different papers analyze different sets of Vs and the union (sum) of all the analyzed Vs among all surveyed papers is 8V and a C, as shown in Table 1.

Big Data Analytics for Wireless and Wired
Network Design: A Survey 4-Value: Is a measure of data usefulness when it comes to decision making [19], or how much added-value is brought by the collected data to the intended process, activity, or predictive analysis/hypothesis [21].5-Veracity: Refers to the authenticity and trustworthiness of the collected data against unauthorized access and manipulation [21,22].6-Volatility: An indication of the period in which the data can still be regarded as valid and for how long that data should be kept and stored [23].7-Validity: This might appear similar to veracity; however, the difference is that validity deals with data accuracy and correctness regarding the intended usage.Thus, certain data might be valid for an application but invalid for another.8-Variability: This refers to the inconsistency of the data.This is due to the high number of distributed autonomous data sources [24].Other researchers refer to the variability as the consistency of the data over time [22].9-Complexity: A measure of the degree of interdependence and inter-connectedness in big data [20].Such that, a system may witness a (substantial, low, or no) effect due to a very small change(s) that ripples across the system [19].Also, complexity can be considered in terms of relationship, correlation and connectivity of data.It can further manifest in terms of multiple data linkages, and hierarchies.Complexity and its mentioned attributes can however help better organize big data.It should be noted that complexity was included among the big data attributes (Vs) in [20] where big data was characterized as having 6V + complexity.This is how we will arrange it in Table 1.The process of extracting hidden, valuable patterns, and useful information from big data is called big data analytics [44].This is done through applying advanced analytics techniques on large data sets [28].Before commencing the analytics process, data sets may comprise certain consistency and redundancy problems affecting their quality.These problems arise due to the diverse sources from which the data originated.Data pre-processing techniques are used to address these problems.The techniques include integration, cleansing (or cleaning), and redundancy elimination, and they were discussed by the authors in [39].Big data analytics can be carried out using a number of frameworks (shown below) that usually require an upgradeable cluster dedicated solely for that purpose [17].Even if the cluster can be formed using a number of commodity servers [45], however, this still forms an impediment for limited-budget users who want to analyze their data.The solution is presented through the democratization of computing.This made it possible for anysized company and business owners to analyze their data using cloud computing platforms for big data analytics.Consequently, the use of big data analytics is not limited to enterprise-level companies.Furthermore, business owners do not have to heavily invest in an expensive hardware dedicated to analyzing their data [1].Amazon is one of the companies that provide 'cloud-computed' big data analytics for its customers.The service is called Amazon EMR (Elastic MapReduce), and it enables users to process their data in the cloud with a considerably lower cost in a pay-as-you-use fashion.The user is able to shrink or expand the size of the computing clusters to control the data volume handled and response time [1,46] Dealing with big amounts of data is not an easy task, especially if there is a certain goal in mind since data arrives in a fast manner, it is vital to provide fast collection, sorting, and processing speeds.Apache Hadoop was created by Doug Cutting [47] for this purpose.It was later adopted, developed, and released by Yahoo [48].Apache Hadoop can be defined as a top-level, java-written, open source framework.It utilizes clusters of commodity hardware [49].Hadoop V1.x (shown in Fig. 1) consists of two parts: the Hadoop Distributed File System (HDFS) that consists of a storage part, and a data processing and management (MapReduce) part.The master node has two processes, a Job Tracker that manages the processing tasks and a Name Node that manages the storage tasks [50].When a Job Tracker takes job requests, it splits the accepted job into tasks and pushes them to the Task Trackers located in the slave nodes [51].The Name Node resembles the master part, while the Data Nodes represent the slave part [12].There is more explanation in the HDFS part below.
Many projects were developed in a quest to either complement or replace the above parts, and not all projects are hosted by the Apache Software Foundation, which is the reason for the emergence of the term Hadoop ecosystem [47].Hadoop V2.x is viewed as a three-layered model.These layers are classified as storage, processing, and management, as shown in Fig. 2. The current Hadoop project has four components (modules), which are MapReduce, the HDFS, Yet Another Resource Negotiator (YARN), and Common utilities [17].1-MapReduce: As a programming model, MapReduce is used as a data processing engine and for cluster resource management.With the emergence of Hadoop v2.0, the resource management task became YARN's responsibility [17].WordCount is an example illustrating how MapReduce works.As the name implies, it calculates the number of times a specific word is repeated within a document.Tuples ⟨, 1⟩ are produced by the map function, where  and 1 represents the word and the times it appeared in the document respectively.The reduce function groups the tuples that share the same word and sums their occurrences to reach the concluding result [61].2-HDFS: HDFS represents the storage file-system component in the Hadoop ecosystem.Its main feature is to store huge amounts of data over multiple nodes and stream those data sets to user applications at high bandwidth.Large files are split into smaller 128 MB blocks, with three copies of each block of data to achieve fault tolerance in the case of disk failure [17,52,53].3-YARN: YARN was introduced in Hadoop version 2.0, and it simply took over the tasks of cluster resource management from MapReduce and separated it from the programming model, thus making a more generalized Hadoop capable of selecting programming models, like Spark [54], Storm [55], and Dryad [56,57].

4-Common utilities:
To operate Hadoop's sub-projects or modules, a set of common utilities or components are needed.Shared libraries support operations like error detection, Java implementation for compression codes, and I/O utilities [17,58].
Over the last few years, researchers in telecommunication networks started to consider big data analytics in their design toolbox.Characterized by hundreds of tunable parameters, wireless network design informed by big data analytics received most of the attention, however, other types of networks received increasing attention as well.The vast amount of data that can be collected from the networks, along with the distributed modern high-performance computing platforms, can lead to new cost-effective design space (e.g.reducing total cost of ownership by employing dynamic Virtual Network Topology adaptation) when compared to classical approaches (i.e.static Virtual Network Topologies) [59].This new paradigm is promising to convert networks from being sightless tubes for data into insightful context-aware networks.Our contributions in this paper are as follows: 1-We show in this paper the role big data analytics can play in wireless and wired network design.2-The above role is corroborated through the illustration of case studies in Section 2. 3-The significance of this paper lies in helping academic researchers save much effort by understanding the state-ofthe-art and identifying the opportunities, as well as the challenges facing the use of big data analytics in network design.4-In addition to academic approaches, we surveyed network equipment manufacturing companies highlighting network solutions based on big data analytics.We also identified the common areas of interest among these solutions, and thus this survey can benefit both academic and industrial-oriented readers.5-This paper provided insights on potential research directions as illustrated in Section 8.
This paper is organized as follows: Section 2 presents several case studies uses big data analytics in wireless and wired networks.Sections 3-6 illustrate the research conducted in the direction of employing big data analytics in the fields of cellular, SDN & intra-data center, optical networks, and network security, respectively.Section 7 summarizes some of the main big data-based network solutions offered by industry.Section 8 discusses the network design cycle based on big data analytics and highlights the challenges encountered in big data-powered network design.In Section 9 we propose open directions for future research.Finally, the paper ends with conclusions in Section 10.

2.
Case studies of the use of big data analytics for wireless and wired networks.

Detection of sleeping cells in 5G SON
A wireless cell may cease to provide service with no alarm triggered at the Operation and Maintenance Center (OMC) side.Such cells are referred to as sleeping cells in self organizing networks (SON).The authors in [60] tackled this problem and presented a case study on the identification of the Sleeping Cells (SC).The simulation scenario comprised of 27 macro sites each with three sectors.The user equipment (UE) is configured to send radio measurement and cell identification data of the serving and neighboring cells to the base station, in addition to event-based measurements.The above-mentioned measurements are sent periodically (i.e.every 240 ms).The simulation considered two scenarios; reference (a normally-operating network) and SC.The latter was simulated by dropping the antenna gain from 15 dBi (reference scenario) to -50 dBi (SC scenario).Measurements reported from UEs are then collected from each scenario and stored in a database.The reference scenario provided measurements used by an anomaly detection model that is based on k-nearest-neighbor algorithm to provide a network model with normal behavior.Multidimensional Scaling (MDS) is used to produce a minimalistic Key Performance Index (KPI) representation.Thus the interrelationship between Performance Indexes (PIs) is reflected and an embedded space is constructed.Consequently, similar measurements (i.e.normal network behavior) lie within close distances while dissimilar measurements (i.e.anomalous network behavior) are far-scattered and hence easily identified.The model attained 94 percent detection accuracy with 7 minutes training time.

A proposed architecture for fully automated MNO
reporting system.Mobile Network Operators (MNOs) collect vast amounts of data from a number of sources as it can offer actionable plans in terms of service optimization.Visibility and availability of information is vital for MNOs due to its role in decision making.Employing a reporting system is pivotal in the cycle of transforming data to information, knowledge, and lastly to actionable plans.The authors in [61] presented a case study aimed at illustrating the potential role of big data analytics in the development a fully automated reporting system.A Moroccan MNO is to benefit from the alternative architecture.The authors highlighted the shortcomings of the existing automatic reporting system that uses traditional technologies.Moreover, they inferred that using big data analytics can provide the opportunity to overcome those shortcomings.
The authors chose the Apache Flink [61] in their proposed architecture to serve as their big data analytics framework.
Several reasons contributed towards this choice, including the Apache Flink's ability to process data in both stream and batch modes, ease of deployment, and fast execution when compared to other frameworks such as Spark.Furthermore, the apache Flink can be integrated with other projects like HDFS for data storage purposes.Moreover, Apache Flink is scalable which makes it an optimal choice for this system.

Network anomaly detection using NetFlow data
Big data analytics can support the efforts in the subject of network anomaly and intrusion detection.Towards that end, the authors in [62] proposed an unsupervised network anomaly detection method powered by Apache Spark cluster in Azure HDInsight.The proposed solution uses a network protocol called NetFlow that collects traffic information that can be utilized for the detection of network anomalies.The procedure starts by dividing the NetFlows data embedded in the raw data stream into 1 minute intervals.NetFlows are then aggregated according to the source IP, and data standardization is carried out.Afterwards, a k-means algorithm is employed to cluster (according to normal or abnormal traffic behavior) the aggregated NetFlows.The following step is to calculate the Euclidean distance between the cluster center and its elements.The procedure concludes by evaluating the success criteria.The authors considered a dataset containing 4.75 hours of records captured from CTU University to analyze botnet traffic.The proposed approach attained 96% accuracy and the results were visualized in 3D after employing Principal Component Analysis (PCA) to attain dimension reduction.

Role of big data analytics in cellular network design
In this section, we review the research done on the use of big data analytics for the design of cellular networks.Compared to other network design topics, we observed that the wireless field has received the most attention, as measured by its share of research papers.These papers can be classified according to the application or area under investigation.Consequently, we have classified those papers into the following: The results of the above method were then compared with a drive test (which is an expensive and time-consuming approach) results, where coherence between the two results was demonstrated.Another comparison was conducted with the Key Performance Index (KPI)-based approach and the results were in favor of the proposed approach.

Signaling data-based intelligent LTE network
optimization By utilizing the combination of all around signaling and user and wireless environment data, combined with Self-Organized Network technologies (SON), full-scale automatic network optimization could be realized.
The authors of [27] developed an intelligent cellular network optimization platform based on signaling data.This system involves three main stages: problems can provide an optimization experience.As a consequence, a variety of network problems can be verified.For example, when a cell has a low handover success rate, according to the definition of the associated indicators, the reason is suggested to be the low success rate of the handover preparation.The solution would be to adjust the overlapping coverage areas formed between the source and the target cells and the parameters (e.g., the decision threshold offset and the handover initiation).
A recommended solution can be provided when a deteriorating indicator surfaces, and this is simply done by clicking the index query that caused the deterioration.

Anomaly detection in cellular networks
When a certain problem occurs in the cellular network, the user would usually be the first who feels the service disruption and suffers the impact.An abnormal and disrupted service may be identified by examining the Call Detail Record (CDR) of the users in a specific area.CDR files are generated upon making a call, and include, among other information, the caller and called numbers, the call duration, the caller location, and the cell ID where the call was initiated or received.
A CDR based Anomaly Detection Method (CADM) was proposed by the authors in [64].CADM was used to detect the anomalous behavior of user movements in a cellular network.This was done, first, with the CDR data being collected from the network nodes and stored in a mediation department.Then, the second phase starts by distributing the collected CDRs to the relevant departments (e.g., data warehouse, billing, and charging departments).After that, the Hadoop platform is used to detect the anomalies.The discovered anomalies are then fed-back to the mediation department for adequate actions.
The use of big data analytics was essential in this case.Large datasets that require distributed processing across computer clusters were processed by the Hadoop Platform.The result was an improved system that is able to detect location based anomalies and improve the cellular system's performance.

Self-healing in cellular networks
The idea to develop a system that is capable of monitoring itself, detecting the faults, performing diagnoses, issuing a compensation procedure, and conducting a recovery is very appealing.However, the self-healing process has another factor to keep in mind, which is time.The process should be carried out within a reasonable amount of time so it would not degrade the quality of the delivered services.
Three use cases were presented by the authors in [65] for a self-healing process in cellular networks: 1-Data Reduction: The Operation and Maintenance (O&M) database can be used for troubleshooting purposes.However, the database size is relatively large as it contains the data related to both normal and degraded intervals, which makes it difficult to process.Separating the intervals to just keep the degraded intervals will help in reducing that size.The authors proposed parallelizing this process independently by analyzing each BS separately.They chose the degraded interval detection algorithm of [66] (a degraded interval is the time where the BS behavior is degraded), and these intervals were detected by comparing the BS's KPIs to a certain threshold.This algorithm was parallelized by implementing it as a map function, a field is added to identify each BS, and all the fields are added by a reduce function.2-Detecting Sleeping Cells: Cell outage or sleeping cells is a common problem in mobile networks.Users are directed to neighboring cells instead of the nearest and optimal cell.According to the algorithm described in [67], sleeping cells can be detected through the utilization of neighboring BS measurements hence calculating the impact of the sleeping cell outage.The detection process relies on the Resource Output Period (ROP), where each BS produces Configuration Management (CM), Fault Management (FM), and Performance Management (PM) data every 15 minutes.For each BS, incoming handovers from neighboring BSs are aggregated for the current and previous ROP.If the number of handovers suddenly dropped to zero, and a malfunction is indicated by the cell's Performance Indicators (PIs), the cell is regarded as a sleeping cell.The authors in [65] proposed the use of the abovementioned algorithm under the big data principle.They proposed to divide the terrain into partitions that are the maximum distance between neighbors, where each BS within the partitioned area is sequentially tested by an instance of the algorithm, and this is done by examining the data of its neighbors.
This approach was compared to other methods (e.g., lack of KPIs and availability of KPIs), and most of the simulated outages were detected (5.9% false negatives and 0% false positives).While a lack of KPIs and availability of KPIs methodologies showed a high percentage of false negatives.

3-KPI Correlation-Based Diagnosis:
The authors in [65] used a method that utilizes most correlated KPIs to identify the problem cause.To simplify the analysis task, the algorithm considers the PIs of both the affected BS and the neighboring sectors.
MapReduce was used to implement this algorithm in a parallelized manner, the correlation process and the creation of a PIs list arranged by correlation were implemented as map and reduce functions, respectively.

Cell site equipment failure prediction
A sudden outage of services might have serious consequences, and this is why keeping communication equipment, like cell sites, in a good working state is of high importance.The challenge identified by the authors in [68] is to analyze the user's bandwidth on the cell level.Equipment(s) failure and infrastructure faults can be predicted by analyzing the bandwidth trends in a particular cell.
Due to the size and diversity of the collected data, it is essential to use big data analytics to process it.Thus, the customers' received bandwidth can be acquired over a particular time period (i.e., month or year, etc.).Next the data from diverse data sources are integrated and then analyzed to know the bandwidth trends.

Large-scale cellular network traffic monitoring and analysis
Large cellular networks have relatively high data rate links and high requirements to meet.Usually these networks use a high-performance and large capacity server to perform traffic monitoring and analysis.
However, with the continuous expansion in data rates, data volumes, and the requirements for detailed analysis, this approach seems to have a limited scalability.Hence, the authors of [69] proposed a system to undertake that task, utilizing the Hadoop MapReduce, HDFS, and HBase (a distributed storage system that manages the storage of structured data and stores them in a key/value pair) as an advanced distributed computing platform.They exploited its capability of dealing with large data volumes while operating on commodity hardware.The proposed system was deployed in the core side of a commercial cellular network, and it was capable of handling 4.2 TB of data per day supplied through 123 Gbps links with low cost and high performance.

Mobile internet big data operator
China Unicom, China's Largest WCDMA 3G mobile operator with 250 million subscribers in 2012, introduced an industry ecosystem.The researchers in [70] highlighted this as a telecom operator-centric ecosystem that is based on a big data platform.
The above-mentioned big data platform is developed for retrieving and analyzing data generated by mobile Internet users.In an aim to optimize the storage, enhance the performance, and accelerate the database transactions, the authors proposed a platform that uses HDFS for distributed storage.The cluster had 188 nodes used to store data, perform statistical data analyses, and as management nodes.The approximate storage space was 1.9 PB.HBase has the role of the distributed database, with a writing rate that can reach 145k records per second; HBase stores the structured data located on the HDFS.
Compared with the Oracle database, it is noted that the system achieved a four times lower insertion rate.The query rate was compared to an Oracle database as well, and the HBase showed a better performance when taking into consideration the impact imposed by the records' size.

Cache and content delivery 3.3.1 Optimized bandwidth allocation for content delivery
Mobile networks, usually, have a large number of users, and with the increase in Internet-based applications, it has become essential to allocate the required bandwidth that meets the user expectations, as well as to ensure a competitive level of service quality.Cellular networks can provide Internet connectivity to their users at any time; however, video (especially high quality) contents are still slow and relatively expensive.From the base station's point of view, the impact of forwarding the same video content to several users on the same base station is massive.The LTE system addressed this through multicast techniques.However, multicast is still regarded as a big challenge in cellular networks.To overcome the above problem, the authors of [26] proposed a solution that can dynamically allocate bandwidth.The idea is based on sharing the base station's wireless channel by a user cluster that wishes to download the contents.Thus, saving the base station resources, as well as providing a better data rate for the clustered users, and providing an opportunity for the users who did not join the cluster to benefit from the saved resources (bandwidth).It should be noted that the clustered users can receive the contents from the cluster head by using short range communication techniques like Wi-Fi Direct [71] and Device to Device (D2D).
Two conditions have to be satisfied before forming a user cluster.First, the users who request the same content are the ones who form the cluster.Second, the users should be or will be within a short range of each other.For that reason, the authors suggested using big data analytics to identify the users' closeness and to group the users into cluster(s).A cluster head is then selected among the nearby users, and the process is repeated among the base station users until there is either a cluster of users or a free (un-clustered) user(s).The simulation was carried for a single base station network and the results showed faster content delivery and improved throughput at the user level.

Improve cache node determination, allocation, and distribution accuracy in cognitive radio networks
In cognitive radio networks, Secondary Users (SU) have to leave the licensed spectrum when their activity starts to affect the QoS level of the licensed users.This move would require the existence of a cache node to compensate for the interrupted data transactions during the SU switch to the unlicensed spectrum.
The author of [72] proposed the use of big data analytics to process the data accumulated over time within the nodes.The goal was to utilize this data to reach a decision on the cache node distribution in a cluster network.
The author selected two out of three categories (open and selectively open systems) of cognitive radio networks.Due to the nature of the open systems, every SU willingly shares its information to be processed, which results in a large amount of data, so the prediction accuracy is high.
For the selectively open systems, the SU selectively shares its information with either some cache nodes, with the cluster head for a particular time interval, or with specific SUs in a cluster.This results in a variable amount of shared data, thus resulting in variable accuracy.

Tracking and caching popular data
The number of social network (i.e., Facebook and Twitter) users is massive.The multimedia contents of these networks are normally shared between common interest groups.However, big and important events attract a lot of attention and consequently a lot of content is shared across these networks.When a certain video or event goes viral, this sharing will eventually burden the network as the requested content would have to travel along the network on its way to the servers.The solution to such a problem was suggested by the authors of [68], they suggested monitoring popular and social media websites, analyzing the data, identifying if there is a growing interest in certain content, by which age category, and caching the popular data for a specific base station.Big data analytics can be of major use in this situation by employing it to do the required analysis.The result would be cached content available to the users faster (reduced provisioning delay) and without burdening the network.

Proactive caching in 5G networks
Cache-enabled base stations can serve cellular subscribers, this is done by predicting the most strategic contents and storing them in their cache.Thus, minimizing both the amount of time and the consumed network bandwidth, which can payoff in other ways (i.e., less congestion and less resource utilization).
An approach, proposed by the authors in [34], used big data analytics and machine learning to develop a proactive caching mechanism by predicting the popularity distribution of the content in 5G cellular networks.They demonstrated that this approach can achieve efficient utilization of network resources (backhaul offloading) and an enhanced user experience.
After collecting the raw data, i.e., the user traffic, the big data platform (Hadoop) has the task of predicting the user demands by extracting the useful information, like Location Area Code (LAC), Hyper Text Transfer Protocol (HTTP) request-Uniform Resource Identifier (URI), Tunnel Endpoint Identifier (TEID)-DATA, and TEID for control and data planes.Then using this information to evaluate the content popularity from the previously collected raw data.Experimentally testing this work on 16 base stations, as part of an operational cellular network, resulted in 100% request satisfaction and 98% backhaul offloading.

Big data-driven mobile network optimization framework
When thinking about optimizing a cellular network, it is important to collect as much information as possible.Large networks, as well as their users, generate a plethora of data, for which the use of big data analytics is vital to analyze the colossal amount.
The authors in [73] proposed a mobile network optimization framework that is Big Data Driven (BDD).This framework includes several stages, starting from the collection of big data, managing storage, performing data analytics, and the last stage of the process is the network optimization.
Three case studies were used to show that the proposed framework could be used for mobile network optimization.

1-Managing resources in HetNets:
The Mobile Network Operators (MNOs) may use big data to provide real time and history analysis across users, mobile networks, and service providers.MNOs can benefit from BDD approaches in the operation and deployment of their network, and this can be done in several stages: A) Network Planning: Due to a deficiency in the level of sufficient statistical data, evolved Node B (eNB) sites are not optimally optimized, this can be dealt with if an adequate amount of information (user and network) is provided for analyses.Big data analytics can help MNOs reach better decisions concerning the deployment of eNB in the mobile network.The authors in [73] suggested the use of the network and anonymous users' data (e.g., dynamic position information and other service features).Providing a relation between the data and their events can offer a better understanding of the traffic trends.Big data sets provide actionable knowledge to reach an optimal decision concerning how and where to deploy eNBs in the network.Another important feature is the ability to prepare for future investments depending on the predicted traffic trends.B) Predictive Resource Allocation: Resource requirements change depending on the density and usage patterns of mobile network subscribers.Predicting where and when mobile users are using the network can help in preparing for sudden significant traffic fluctuations.The authors in [73] suggested the use of big data analytics to examine behavioral and sentiment data from social networks and other sources.They also showed an interest in utilizing current and historical data to predict the traffic in highly populated areas within the network.
Using the cloud RAN architecture [74], the right place at the right time can be served through the predictive resource allocation, thus minimal service disruption can be achieved.C) Interference Coordination: HetNets with small cells can be used to conduct interference coordination among macro and small cells.This coordination has to be carried out in the time domain instead of the frequency domain.
Schemes like the enhanced Inter-Cell Interference Coordination (eICIC) in LTE-Advanced [75] efficiently enable resource allocation among interfering cells, as well as improving the inter-cell load balancing in the HetNets.eICIC allows Macro cells evolved Node B (MeNB) and its neighboring Small cell eNBs (SeNBs) to have data transmitted in isolated subframes, thus interference from MeNB to SeNB can be avoided.To implement eICIC, a special type of subframe named an Almost Blank Subframe (ABS) that carries minimum (and most essential) control information, was defined.It is worth noting that the ABS subframes are transmitted with reduced power [75], and that the network operator can control the configuration of that subframe.Many factors contribute to the determination of the ABS ratio of the macro cell to the small cell, such as the traffic load in a specific area, the service type, and so on.The optimal ABS ratio varies dynamically, and this is due to the fact that inter-cell interference changes with time for the factors mentioned above.In a BDD system, optimizing the radio resource allocation can be accomplished through the use of network analytics.The deployment of BDD optimization functions at the MeNB would enable them to collect and analyze eNB-originated raw big data (e.g., service characteristics and traffic features) in real-time, thus enabling a quick response.As a result, the performance optimization of each cell and the users can be fulfilled.
Optimizing ICIC parameters (e.g., ABS ratio) can be achieved by processing raw data in a periodic manner to acquire statistics and to detect traffic variations automatically.
Furthermore, the location and user traffic demands of multiple eNBs can be optimized, offering the deactivation of a SeNB due to elevated Signal-to-Interference-plus-Noise Ratio (SINR) to avoid the interference caused by a nearby SeNB that would also result in reducing the energy consumption.

2-Deployment of cache server in mobile CDN
Popular content (e.g., movies) can be delivered through a Content Delivery Network (CDN), which is a method that is considered efficient by many MNOs.Distributed cache servers should be located near the users to achieve a fast response as well as to reduce the delivery cost.In hierarchical CDN, it is vital to place cache servers in an optimal location.Due to the unique features that RAN has, it was the primary interest of the authors in [73].
It is expected that there will be an enhanced backhaul capability in 5G networks, and this would result in minimizing the concerns related to the latency and traffic load of backhaul transmissions.Therefore, not all MeNBs would require a dedicated distributed cache server.In addition, a SeNB can have a distributed cache server.
Optimal cache server placement depends on several factors, such as the features and load of traffic in a given area, as well as the cost of storage and streaming equipment.To help the MNOs decide where to deploy their cache servers, data analytics methods can be regarded as a feasible solution.However, this would require the collection of all the above-mentioned factors over a long period in the related coverage area.

3-QoE modelling for the support of network optimization:
The authors of [73] believed that the management of services and applications needed more than just relying on the QoS parameters.Instead, they suggested taking the quality (i.e., QoE), as perceived by the end users, to be regarded as the optimization objective.Accurate and automatic real time QoE estimation is important to realize the optimization objective.In addition to the technical factors, non-technical factors (e.g., user emotions, habit, and expectations, etc.) can affect the QoE.A profile for each particular user comprising the above non-technical factors would help in the QoE evaluation.
Since answering the questions that would lead to a clear profile is not a task that would be fancied by a typical user, the authors suggested installing a profile collection engine on the users' mobile devices.User activities are compared and tracked to recognize differences and similarities, and then they are stored in a database for additional processing.After profiling, the following step constitutes the use of machine learning to identify the relationship between QoE and the influencing factors.
Data analytics can be used to discover what impacts the QoE in users' devices, as well as the services and network resources.The next step is for network optimization functions to react to determining what caused the problem and select the optimal action accordingly.

Improve
QoS in cellular networks through selfconfigured cells and self-optimized handover Cellular networks have a crucial element on which the concept of mobility depends.This element is the handover success rate, which ensures call continuity while the user moves from one cell to another.Failing in that particular element would impact the quality of the service, thus putting the operator into a questionable situation.
Operators try to make sure that each cell has a list of manually configured and optimized neighbor cells.Hence, it is vital to note the high probability of these cells failing to adapt when a rapid response is required due to a sudden network change.
The authors in [76] presented two methods that used big data analytics to introduce a self-configured and selfoptimized handover process, the first was associated with newly introduced cells, while the latter was concerned with the already existing cells.The analysis started by collecting and archiving predefined handover KPIs.A dispatcher process is run after the collection period, and its aim is to check the files to see if they were marked as new cells (where Self-Configuration Analytics is started) or not (where Self-Optimization Analytics is started):

1-NCL self-configuration for new cells:
Newly installed base stations require Neighbor Cell List (NCL) to be configured on the new cells.The selection process takes into consideration the antenna type, the azimuth angle (for directional cells), the geographic location of the candidate cells, and the process concludes by selecting cells with a minimum distance and maximum traffic load to be the top candidate cells.The NCL is configured via Network Management System (NMS) Configuration Management (CM) tools.

2-NCL self-optimization for existing cells:
The process starts by collecting KPI measurement statistics for the failed and successful handovers, and this task is done by the Performance Management (PM) or the NMS.
Cells with a handover failure rate below a predefined threshold are excluded from the NCL, while unlisted neighboring cells with a successful rate above a predefined threshold are considered as new neighboring cells.

Optimizing the resource allocation in LTE-A/5G
networks The overall system performance evaluation in advanced wireless systems, like LTE, depends on KPIs.In a quest to enhance the user experience, the authors of [77] proposed an approach that utilizes user and network data, such as configuration and log files, alarms, and database entries/updates.This approach relies on the use of big data analytics to process the above-mentioned data.The ultimate goal is to provide an optimal solution to the problem of allocating radio resources to RAN users, and guarantee a minimal latency between requesting the resource and allocating it.This is done through user and network behavior identification, which is a task well-matched for big data analytics.
The proposed framework involves three stages: First stage: This process is carried out in the eNB system, processing the data from the cellular and core network side.Binary values are acquired by comparing cellular level KPIs to their respective threshold values, thus keeping the binary matrix updated.This procedure is repeated at fixed intervals.
Second stage: Repeat the same steps as in the first stage.However, this process is carried out on subscriber level data to acquire subscriber KPI, and maintain a binary matrix.
Third stage: This is activated when a user initiates a resource allocation request.A binary pattern is generated based on the user requirements.This pattern is later handed over to stage 2 to update the binary matrix (if required) and incorporate the new values in the row that represents the requested bandwidth.
After generating the updated row, it is transferred to the first stage for comparisons with the current Physical Resource Block (PRB) groups.To identify which PRBs suit the user, the fuzzy binary pattern-matching algorithm [78] was used for that purpose.Using this algorithm, the execution time increased linearly for an exponential increase in the number of comparison patterns.

Framework development for big data empowered SON
for 5G The authors of [79] proposed a framework called Big data empowered SON (BSON) for 5G cellular networks.Developing an end-to-end network visibility is the core idea of BSON.This is realized by employing appropriate machine learning tools to obtain intelligence from big data.
According to the authors, what makes BSON distinct from SON are three main features: • Having complete intelligence on the status of the current network.• Having the ability to predict user behavior.
• Having the ability to link between network response and network parameters.
The proposed framework contains operational and functional blocks, and it involves the following steps: 1-Data Gathering: An aggregate data set is formed from all the information sources in the network (e.g., subscriber, cell, and core network levels).2-Data Transformation: This involves transforming the big data to the right data.This process has several steps, starting from: a. Classifying the data according to key Operational and Business Objectives (OBO), such as accessibility, retainability, integrity, mobility, and business intelligence.b.Unify/diffuse stage, and the result of this stage is more significant KPIs, which are obtained by unifying multiple Performance Indicators (PIs).c.According to the KPI impact on each OBO, the KPIs are ranked.d.Filtration is performed on the KPIs impacting the OBO less than a pre-defined threshold.e. Relate, for each KPI and find the Network Parameter (NP) that affects it.f.Order the associated NP for each KPI according to their association strength.
g. Cross-correlate each NP by finding a vector that quantifies its association with each KPI.3-Model: Learn from the right data acquired in step 2 that will contribute to the development of a network behavior model.4-Run SON engine: New NPs are determined and new KPIs are identified using the SON engine on the model.5-Validate: If a new NP can be evaluated by expert knowledge or previous operator-experience, proceed with the changes.Otherwise, the network simulated behavior for new NPs is determined.If the simulated behavior tallies with the KPIs, proceed with the new NPs.6-Relearn/improve: If the validation in step 5 was unsuccessful, feedback to the concept drift [80] block, which will update the behavior model.To maintain model accuracy, concept drift can be triggered periodically even if there was a positive outcome in the validation step.

User-centric 5G access network design
Enhancing the user experience, giving a higher data rate, and reducing the latency are considered the key goals of a 5G system.The authors of [27] expect the following to be the key elements in the design of a user-centric 5G access network.

Personalized local content provisioning
It is important for the access network to evolve from being user and service agnostic, by acting merely as a blind pipe that connects the user to the core network, to becoming usercentric.The latter term would direct the network access to the right path of being user and service aware, thus facilitating a key technology in 5G, which is local content provisioning, this in-turn would pay off in the form of end-to-end latency reduction and enhancing the user experience.
The role of big data analytics becomes very clear, as it provides a necessary ability of predicting user requirements.These requirements can be met in the case of local availability.
The authors of [27] proposed several steps to achieve content provisioning, and they are as follows: • Traffic and user Information Acquisition: Traffic attributes (application type, server address, and port number, etc.) are collected through packet analysis, and analyzed using a clustering algorithm (e.g., k-means [81]) to perform traffic labelling (news, sports, and romance, etc.).• Analyzing and Predicting User Requirements: This step is achieved by a big data analysis algorithm (an algorithm like collaborative filtering was recommended [82]), which utilizes the traffic attributes mentioned above, and thus recommends content based on user's interests.• Local Caching and Content Management: Popular content is provided in the form of local copies.Content that might be of interest to the users are to be locally cached after being downloaded from the application server.• Content Provisioning: When an application request is initiated by the user, the system will check if the content is already locally cached, so it can be sent directly.Furthermore, big data analysis can give content recommendations, thus the system will check if these contents are cached locally, and push them directly to the user.

Providing flexible network and functionality deployment
The authors of [27] noted that processing the characteristics of both the regional user and service using big data analytics can be of great help in flexible network and functionality deployment in the following ways: • Flexible Network Deployment Since 5G will support diverse low-cost Access Points (APs), such as coverage APs and hot spot APs, using big data analytics can be useful to forecast the traffic characteristic, hence establish the base for achieving a dynamic network deployment for the APs.

• Flexible Functionality Deployment
Analyzing and predicting regional user and service requirements can be realized using big data analytics.This will form the foundation of dynamic functionality deployment, which will help decide where to deploy certain functionality modules (e.g., safety modules where there are security requirements).

Using user behavior awareness to optimize wireless resources 5G networks
Optimization of wireless resources should be carried-out according to the user and 5G service requirements.This is done to enhance the user experience and improve the efficiency.
According to what the authors of [27] proposed, big data can be used to analyze user mobility patterns, predict the motion trajectory, and hence pre-configure the network accordingly.Each user's historical Access Point (AP) list is recorded by the APs.This data can be uploaded to a central module for processing, or to a target AP in case the service AP was altered.Using a big data analytics algorithm, the collected data is analyzed to forecast the motion trajectories.

Big data-based network operation system
The authors of [27] proposed a system that can maximize the efficiency of big data based network operation.This can be fulfilled by optimally allocating the network resources to each AP and user.The proposed system consists of two parts: 1-Decision making domain This is responsible for collecting and managing the user, network upgrades, configuration, service, and terminal state information.This domain exploits big data analytics to provide the basic configuration essential for network initialization.For this domain to function properly, it would need to realize the whole picture of the user and service requirements, as well as the functionality distribution in the network.

2-Implementation domain
Its responsibility is mainly for status reporting of the user, terminal, and network, dynamic deployment, and network configuration.Depending on the requirements after acquiring the personalized configuration, this domain can use the dynamic APs functionality and configuration to build multi-connectivity bearers with terminals.

Multi-RAT or HetNet energy saving
Small cells are used in multi-Radio Access Technology (RAT) or HetNet mobile networks.For energy saving purposes, when traffic falls below a certain threshold, small cells may enter into a dormant state, and this forces the small neighboring cells to serve the dormant cell's coverage area.Several energy saving schemes were discussed in [83].However, these schemes failed to adapt to the dynamic temporal and spatial traffic variations, as they operate under a relatively long time scale leading to unacceptable delay.This happens because when a number of newly arrived User Equipment (UE) seek access to the network, where cell activation is essential, the corresponding small cells requires to be first activated, before operating normally.Only at that point can the UEs access the network using the standard procedure.UEs, however, would suffer from an unacceptable delay when trying to access the cells.
The authors of [27] proposed a user-centric approach that is based on big data analytics.The proposed solution claims to achieve optimal implementation for the activation of small cells as well as UE access to the network.Thus, joint optimization for both UEs' access and energy saving can be achieved.

Network flexibility using consumption prediction
Consumption analysis is concerned with two factors: customer locations and type of service.Consumption trends can be classified in a timely manner into long-term, seasonal, and short-term.
To reach an accurate prediction, the authors in [68] implied that user data (e.g., GPS location and service usage) can be correlated with other data (e.g., news, social network, events, and weather conditions).Using big data analytics to analyze these correlations, operators would be able to decide when and where to place their nodes without affecting the subscribers' satisfaction.

The role of big data analytics in SDN & intra-data center networks
SDN offers the ability to program the network with a centralized controller, this controller is capable of programming several data planes using one standardized open interface, thus providing flexible architectural support [1].The following research topics utilized the properties of both Software Defined Network (SDN) and big data analytics by employing the analysis results to program the network.Those topics can be classified according to the area under discussion as follows: 1-Traffic Prediction: The paper surveyed in this section employ traffic prediction to optimize network resource allocation.2-Traffic reduction: Pushing the aggregation from the edge towards the network.

Cognitive routing resources in SDN networks
The network's next generation has to be smart and flexible, with the ability to modify its strategy according to the network status in an automatic manner.To simplify the network management, SDN has made the above requirement possible by decoupling the control and forwarding planes through the OpenFlow protocol [84].
OpenFlow is considered to be the first standardized protocol in SDN, it is also identified as the enabler of SDN.SDN/OpenFlow has influenced Google to switch to OpenFlow in its inter-data center network, which resulted in an approximate 99% increase in the average Google WAN link utilization [85].
The architecture proposed by the authors of [85] included the following parts: 1-User preference analysis server: The authors adopted the Hadoop platform to realize the prediction functionality.They utilized the analyses of both network traffic and user application information to find each application flow distribution.For each data flow, they found a specific distribution law.They analyzed that law, and for different applications and areas they developed a preliminary general prediction model to fit the case of the same application but in different areas.

2-Interface design between SDN controller and database:
A cloud platform is responsible for calculating and predicting the flow distribution values of each OpenFlow switch.In addition, this platform will read the link information and perform traffic prediction.A database will hold the recorded values and the last predicted values will be updated.To ensure that the allocation of resources accommodates the traffic variation, Floodlight (a Javabased SDN controller that can accommodate different applications by loading different modules) will read the newest predicted values from the database regularly.

3-SDN controller-based routing module:
The predicted values are used as preferences to select the best route.The researchers used an improvement on the Dijkstra algorithm.Application awareness and the prediction of user preferences were integrated into SDN through the newly proposed architecture, which would facilitate and enhance network resources allocation and provide better application classification.
The role of big data is exemplified by its ability to use network flow analyses and users' behavior to forecast the type and rate of the incoming traffic flows.
The procedure proposed by the authors of [85] is as follows: 1-Current network load (i.e., traffic volume and type, etc.) is read by a cloud platform.2-The overall traffic is predicted in advance and gathered in a database.This is done using a big data-powered prediction algorithm.3-The SDN controller accesses the database and reads the stored data mentioned in step two.4-A resource allocation scheme is created by the SDN controller using the above-mentioned routing algorithm, and this scheme is sent later to the related switches.Big data analytics can use the users' requirements to provide the network with a dynamic resource allocation and application classification, hence providing the network with better load balancing techniques.
The results from the implemented testbed showed the ability of the proposed solution to self-adapt towards flow variation by dynamically issuing traffic tables to the related switches, which can increase the resource utilization and attain an improved overall load balance.

Predicting data communication volume at runtime in
data center networks using SDN Networks that deal with big data applications may suffer from the size and speed of data.For example, the networks' overall response time is affected by MapReduce's heavycommunication phase.This problem can be intensified if the communication patterns experience a heavy skew impact.The authors of [86] have proposed Pythia, which is a system that can optimize the data center network at the runtime by utilizing the real time communication prediction of Hadoop.It also maps the end-to-end flows to the underlying network.
Pythia utilizes the SDN-offered programmability to achieve efficient and timely network resource allocation for shuffle transfers.
Depending on the network workload and blocking ratio, the Hadoop workload saw a consistent acceleration when using Pythia, and job completion time was reduced between 3% and 46% in comparison with MapReduce benchmarks.

Increasing network performance through traffic reduction
Large networks, such as those of Google and Facebook, or even small and medium sized enterprise networks suffer from a plethora of traffic.This happens due to the colossal amount of data being processed either in batch or real time applications.
A common solution is to increase the available bandwidth in the enterprise clusters.However, the authors of [87] proposed another approach that improves the network performance, pushing the data aggregation from the edge towards the network, thus decreasing the traffic.
A platform called CamCube [88] was used; it substitutes the use of dedicated switches by distributing the functionality of the switch across the servers.It is worth noting that CamCube offers the ability to intercept and modify packets at each hop.In addition, it uses a direct-connect topology, in which, a 1Gbps Ethernet cross-over cable is used to connect servers to each other in a direct way.
Exploiting CamCube's properties to realize high performance, Camdoop, which is a CamCube service that runs MapReduce-like jobs, was used.It offers full on-path data stream aggregation.Camdoop builds aggregation trees where the children are resembled by the intermediate data sources, while the roots are located at the servers performing the final reduction in traffic.
A small prototype of Camdoop running on CamCube was tested, and a simulation was used to show that the same properties still hold at scale.The results showed a significant traffic reduction with the proposed system when compared to Camdoop running on a switch and when compared to systems like Hadoop and Dryad/DryadLINQ [56] [89].

The Role of Big Data Analytics in Optical Networks:
This section discusses research papers that employ big data analytics for optical network design, the topics are classified as follows: 1-Network optimization: Here the parallel processing merits of Hadoop are utilized to reduce the execution time of different (bin-packing) optimization algorithms.2-Traffic Prediction: Using big data analytics to dynamically reconfigure the network according to predicted traffic.

Solving the RWA problem
The Routing and Wavelength Assignment (RWA) algorithm [90] plays an important role in optical networks.The authors in [91] considered the RWA algorithm to be an illustration of the bin-packing problem that is listed as a classical NP-hard problem [92].
The authors in [91] used a Hadoop cloud computing system that consisted of 10 low-end desktop computers to independently run an instance of the RWA algorithm on each of them for a certain number of demand sequences.The goal is to sufficiently evaluate the demand sequence within a short period.The procedure is as follows: 1-An input file is fed to the HDFS, it incorporates the information of the lightpath demand requests.2-The file is read by the map function, where the demand list is regarded as a value and combined with different keys ranging from 0 to 19 that later serve as random seeds in the reduce functions.It is worth noting that the authors set two reduce functions per computer (i.e., a total of 20 reduce functions).3-The key-value pairs are then forwarded to the 20 reduce functions where parallel computing is conducted.The lightpath demand list is shuffled 250 times in a random fashion (i.e., 5000 shuffled demand sequences), and this happens for each key-value pair within each reduce function.To acquire the number of required wavelengths, the RWA heuristic is run for each of the shuffled demand sequences.The optimal result of each reduce function is then compared against the remaining 19 to find the global optimum.Different test networks (ranging from 20 to 500 nodes) were used to evaluate the performance of the Hadoop system.The results were optimality evaluated by comparing them against the results of an Integer Linear Programming (ILP) optimization model and they showed a close proximity to the optima (except for two cases).It is worth noting that the ILP approach assumes full wavelength conversion, which plays the role of the performance's lower bound in the present evaluation.

Solving multiple optimization problems using Hadoop
The authors in [93] proposed to solve several optimization problems in the optical network paradigm.The problems are: 1-Energy minimization problem [94], where the goal is to minimize the overall network power consumption from non-renewable energy sources.
2-Shared Backup Path Protection (SBPP) -based elastic optical network planning problem [95], where a heuristic used the concept of Spectrum Windows Planes (SWPs).
The objective was to minimize the maximum number of Frequency Slots (FSs) in the network.

3-Adaptive Forward Error Correction (FEC) assignment
problem [96], where the goal is to maximize the total number of FSs utilized for user data transmission.A heuristic based on SWPs was developed to solve the Routing and Spectrum Allocation (RSA) problem.
The above problems are of a bin-packing type and classified as NP-hard.Several aspects (i.e. demand size and route) should be considered when serving network traffic demands.Due to the high computational complexity and the order of served demands, the performance of heuristic algorithms trying to solve these problems cannot be guaranteed.This is because of using the simple largest to smallest ordering strategy.Good performance can be attained by randomly shuffling demand sequences, then implement a heuristic algorithm for each sequence and choosing the one with the optimum performance.To shorten the computation time and to overcome the computational complexity, a Hadoop cloud computing system consisting of seven computers was proposed by the authors in [93].This way, a heuristic algorithm can be executed for multiple shuffled demand sequences in a parallel manner.
The Hadoop MapReduce platform makes it possible to evaluate multiple shuffled demand sequences in parallel.A heuristic algorithm serves each of the shuffled demand sequences and a result is produced each time.The results are then compared and the one with the best performance is chosen.The same procedure is repeated on each Reduce function.The final global optimum is found by comparing the results across all reduce functions.
Performance evaluation is done by employing two test networks; the 24-node, 43-link USNET network (adopted for problems 1 and 3) and the 11-node, 26-link COST239 network (used for problem 2).For the first optimization problem, the total consumption of non-renewable energy decreased by 8% (when the number of shuffled demand sequences increased from 1000 to 10,000).As for the second optimization problem, the number of required FSs was significantly reduced.In the third problem, the total number of FSs for user data transmission was increased and 3% performance improvement was noted when compared with the case of running Hadoop on a single machine.The computation time for all three problems was significantly shorter compared to a single-Hadoop machine.

VNT adaptability using traffic prediction
The emergence of new services is placing new demands on networks in terms of large and dynamic bit rate requirements.This caused network operators to look for a Virtual Network Topology (VNT) architecture that can cope with the anticipated traffic in a dynamic fashion.One solution is realized by overprovision the network, however, the downside is the increase in Total Cost of Ownership (TCO).Another solution saves power by using threshold-based capacity reconfiguration [97].The drawback is that there is no saving in the number of transponders that needs to be installed in each IP router compared to overprovisioning.
An alternative solution is proposed in [59] where VNT reconfiguration can be attained regularly using big data analytics.This is done by periodically analyzing Origin-Destination (OD) traffic so that VNT reconfiguration can be performed accordingly.Collection of traffic monitoring data takes place at edge IP routers regularly.A set of traffic samples is collected by every edge router for every other destination router.These sets are stored in a collected data repository.According to a predefined time period, the collected data is then summarized for each OD pair by periodically retrieving the collected data repository and performing data stream mining sketches.The result of this stage is a modeled data repository which includes, among others, maximum, average, and minimum bit rate for every OD pair.Using machine learning techniques (i.e.Artificial Neural Networks), a prediction module generates the predicted OD traffic matrix for the upcoming period.The decision on whether to perform VNT reconfiguration or not is determined by a decision-maker module by relying on the abovementioned matrix.If a reconfiguration is required, a VNT optimizer is provided with both current and predicted OD traffic matrices.The solution is fedback to the network controller to implement the required changes in the data plane.
The performance of the proposed solution was compared against the static and the threshold-based methods.An overall saving between 8% and 42% in the number of installed transponders was achieved.The proposed solution has the ability to react during low traffic hours by deactivating transponders, thus, energy saving is attained.Also, the advantage of cost reduction is attained by releasing lightpaths from the underlying optical layer.

Peer-to-Peer Botnet detection
Many security problems on the Internet are caused by Botnets, which can be defined as networks of malwareinfected machines controlled by an adversary [98].Botnet attacks form a real security concern, with the ability to utilize 90,000 IPs in an attack [99], it is a challenge on an international scale, especially when taking into consideration the financial damage they can inflict.
To detect and neutralize such attacks, security researchers and network analysts consider packet capturing and network tracing to be amongst the most appreciated resources.However, analyzing these massive sized datasets is not an easy task for today's computers.To overcome that challenge, the authors of [100] proposed a scalable threat detection framework that uses the following components: 1-Traffic sniffer: Dumpcap [101] is used for packet sniffing while Tshark [101] is used for field extraction, and the fields are then submitted to the HDFS for storage.2-Feature Extraction Module: The HDFS-submitted files are then processed by Apache Hive [102] for feature extraction.3-Machine Learning Module: Scalability is a requirement when it comes to the machine learning module.This requirement is met using a machine learning library called Mahout [103], thus harnessing the cluster high computational power to achieve optimized results.It is worth noting that Mahout is built on top of Hadoop, and its classification and clustering core algorithms are run as MapReduce jobs.
The proposed approach achieved a detection time within tens of seconds, and the authors claimed that this time can be reduced to less than 10 seconds after some Hadoop tweaking and additions to the cluster.6.2 Improving network security by discovering multi-pronged attacks Networks are considered a target for intruders who would try to infiltrate them.Multi-pronged attacks may spread over network subnets; the spreading might target several scattered network points or take place in different events over time.
To discover and predict such attacks, the authors of [104] proposed a system named Big-distributed Intrusion Detection System (B-dIDS) that relies on two components: 1-HAMR: An in-memory MapReduce engine used for big data processing.It is worth noting that HAMR supports both batch and streaming analytics in a seamless manner.2-An analytics engine: Residing on top of HAMR, the analytics engine includes a novel ensemble algorithm.Its basic principle relies on using clusters with multiple IDS alarms to extract the training data.
The proposed system scans the IDS log data, checking for alarms that might be treated as unthreatening at first glance (when examined separately) but that may result in an opposite judgment after combining them with other alerts.

Device fingerprinting in wireless networks
Big data analytics and machine learning have several algorithms in common [105].More security concerns can be addressed through the use of machine learning algorithms such as device fingerprinting.The authors in [106] have conducted a survey on wireless device fingerprinting methods in wireless networks.They illustrated the main features and techniques used towards this end.Device fingerprinting can be defined as the process of generating device-specific signatures by gathering device information.This is done through analyzing the information across the protocol stack, and it can be used to counter the vulnerability of wireless networks to insider attacks and node forgery.Two types of fingerprinting algorithms were discussed; white-list based (i.e.supervised learning) and unsupervised learning based approaches.
The device fingerprinting process is broken into three main steps; step one is concerned with identifying relevant features found in all layers across the protocol stack.
Step two is where features are extracted and modeled.The features tend to be stochastic in nature due to the dynamic nature of wireless channels, consequently, the models will be stochastic as well.
Step three is where device identification takes place by employing machine learning algorithms (supervised and unsupervised).
The authors reviewed the existing algorithms and concluded that despite the high computational complexity of unsupervised learning methods, their role is limited to detect the presence and the likely culprit involved in the attack while failing to pinpoint the malicious devices in an exact manner.In spite of the limitation, unsupervised learning approaches showed further practicality when compared to white-list based approaches, as they require no pre-registration process and human intervention.

Machine learning methods for cybersecurity intrusion detection
The authors in [107] surveyed the topic of intrusion detection methods based on data mining and machine learning algorithms.They compared different methods taking into account complexity, accuracy, understandability of the final solution, and classification time of an unknown instance using a trained model.They referred to the availability of labeled data as the biggest gap that, if bridged, can lead to significant advances in machine learning and data mining methods in the field of cyber security.They also highlighted an open area for research, namely the investigation of fast incremental learning methods to update misuse and anomaly detection models on a daily bases.
Finally, we summarise the research outcomes related to big data analytics-based network design in Fig 3 .It is clear that the wireless field is getting most of the researchers' attention compared to the other fields.This may be attributed to the more significant challenges faced by wireless networks compared to wired networks and hence the more significant level of opportunities.The larger numbers of papers addressing the use of big data analytics and methods in wireless may also be a reflection of the larger number of researchers that focus on wireless networks compared to wired networks.Furthermore, we present a summary in Table 2 for all the research topics illustrated between sections 2 and 5.

Big data analytics in the industry
Throughout our survey, we came across several companies that offer network solutions based on big data analytics.These companies and solutions are highlighted in Table 3.It should be noted that these solutions are enabled by research conducted in their corresponding areas.We have added academic research papers related to each solution in Table 3. Due to the proprietary nature of industrial products, the exact algorithms or methods behind these products is not available in the open literature.Therefore, academic papers with related concept(s) are cited.NetReflex IP and NetReflex MPLS utilizes big data analytics [27,73,108]to provide services like anomaly analysis and traffic analysis.Nokia provided several solutions targeting the wireless field.For example, Traffica introduces itself as a real-time traffic monitoring tool that analyzes user behavior to gain network insights, similar approaches were presented in academia by the authors of [69,109].The Wireless Network Guardian detects user anomalies in mobile networks where a comparable topic was discussed in [110].Preventive Complaint Analysis makes use of big data analytics to detect behavioral anomalies in mobile network elements where the authors in [111] provided a similar approach.Predictive Care utilizes big data analytics to identify anomalies in network elements before affecting the user, a comparable academic approach is presented in [110,112].HP presented Vertica, a solution that exploits CDRs for network planning, optimization, and fault prediction purposes.Analyzed inter-technology (2G-3G) failed handovers.[27] Used XDR data to discover network failures and present a solution advice.[64] Developed CADM which uses CDRs to identify anomalous sites.[65] Presented three case studies of self-healing using big data analytics.[68] Suggested the analysis of the bandwidth trends to predict equipment failure.
Network Monitoring [69] Developed a Hadoop-based system to monitor and analyze network traffic.[70] Developed a solution powered by big data platforms with distributed storage and distributed database to solve the issues of data analysis and acquisition.

Cache and
Content Delivery [26] Utilized big data to form a cluster made up of nearby users that share the base station's wireless channel.[72] Analyzed the data that resides within the cache nodes to enhance the determination, allocation, and distribution of cache nodes.[68] Suggested monitoring and analyzing social media and popular sites, to predict and cache certain contents, according to age category, at the predicted locations where these contents are highly demanded.[34] Proposed the use of big data analytics and machine learning techniques to proactively cache popular content in 5G networks.
Network Optimization [73] Presented three case studies in which a proposed network optimization framework is efficiently utilized.In particular, the work suggested: 1) The use of big data analytics to manage resources in HetNets.This is done in three stages (network planning, resource allocation, and interference coordination).
2) The deployment of cache servers in mobile CDN.
3) The optimization of networks with QoE in mind.[76] Proposed NCL self-configuration/optimization algorithms to achieve an automatic, self-optimized handover.The work relied on the processing of CM and PM KPIs using big data analytics platform.[77] Developed a three-stage framework that utilizes the network and user KPIs to reach an optimal allocation of radio resources (PRBs).[60] Presented a framework that uses big data collected from the cellular network to empower SON.They also presented a case study on how to detect sleeping cells using this framework.[27] Investigated the impact of big data on 5G networks in terms of: 1) Efficient content provisioning.2) Flexibility in functionality and network deployment.
3) Utilizing user behavior in wireless resource optimization.4) Achieving highly efficient network operation.5) Saving energy in HetNet or Multi-RAT networks.[68] Correlated location data, service usage, and other contextual data to predict the consumption trends and select the optimal node location.

SDN and
Inter-Data Center Traffic Prediction [85] Dynamic allocation of network resources by relying on traffic predicted by employing Hadoop platform.[86] Developed Pythia, a system that uses Hadoop's properties to predict the volume of data communication at runtime in data center networks.

Traffic
Reduction [87] Proposed Camdoop and run it over CamCube, the performance surpassed that of Camdoop running on a conventional switch.

Optical
Network Optimization [91] Used Hadoop to find a solution for the RWA problem.[59] Predicted future traffic by using Big Data Analytics to reconfigure VNT regularly.[93] Employed Hadoop MapReduce to solve multiple optimization problems of bin-packing nature.

Traffic Prediction
Security [100] Proposed a threat detection framework to detect peer-to-peer Botnet attacks.[104] Developed B-dIDS, a system that scans IDS log files to detect multi-pronged attacks in distributed networks.[106] Surveyed the topic of wireless device fingerprinting methods and how machine learning algorithms can be used for device identification.[107] Provided a survey discussing the role of data mining and machine learning algorithms in intrusion detection methods.
The authors in [64,113] researched akin approaches.Amdoc's Deep Network Analyzer provides predictive maintenance and proactive network deployment for cellular networks.The authors in [114] presented a similar approach.Log analytics can be used for a variety of purposes.Aprevi's ARLAS solution provided real-time collection and storage of network logs.Related academic research was presented by the authors in [115][116][117].
Examining the above solutions, one can note that the majority of the solutions are in the wireless field.This, in fact, coincides with the orientation of the academically-researched topics.Sampling through the offered solutions, we noticed the increased interest in anomaly prediction and network node deployment.Thus, offering the customer a service that is as close to optimal as possible, while minimizing network expansion expenditures.

Big data analytics-powered design cycle and challenges
In this section, we are highlighting a common theme among most of the surveyed papers.This can be realized as depicted in Fig. 4. Also, we are illustrating the challenges facing the implementation of big data analytics in network design and operation.Wireless Network Guardian [110] Improves end-to-end network analytics and reporting with real-time subscriber-level information.
Detects anomalies and reports airtime, signaling, and bandwidth resource consumption.Proactive detection of issues, including automatic detection of user anomalies and low QoE score alerts.Preventive Complaint Analysis [111] Detects network elements' behavior anomalies.Predicting where customer complaints might arise and prioritizes network optimization accordingly.
Predictive Care [110,112] Used for network elements, and proved its effectiveness by helping Shanghai Mobile become more agile and responsive.Accuracy of the simplified alerts is around 98 percent, reducing operational workload.
Examines dropped call records and other maintenance data to determine where to invest in infrastructure.
Failure prediction and proactive maintenance. 4

Amdocs
Deep Network Analytics [114] Combines RAN information with BSS and customer data to deploy the network proactively.Predictive maintenance.

Big data analytics design cycle
The quest for a well-designed communication network is never-ending.Researchers in the big data era rely on the capabilities offered by big data analytics to transform the way networks are being designed.This includes employing big data analytics to predict and minimize the bandwidth utilization, anticipate and prepare for upcoming failures, and predict the precise energy requirements.Hence, creating a network with fewer outages, higher user satisfaction, and an enhanced performance.
The network design process using big data can be outlined as shown in Fig. 4. Big data is collected from the network, stored, and processed in a big data cluster to extract useful information, such as trends, patterns, and correlations (step 1).The resulting information is then transferred to the decisionmaking platforms where a new design decision for the network is evaluated by algorithms based on the inward inferred knowledge (step 2).Finally, the new design decision is sent as feedback configuration parameters to the network where re-configuration is implemented (step 3).
It should be noted that the duration of the above-mentioned cycle might vary depending on the application type of the network, e.g., enterprise, healthcare, agriculture, or transportation.For instance, enterprise networks can generate large amounts of data over a short period and usually configuration faults could be undone anytime.On the other hand, healthcare networks usually generate less monitoring data over time, and they should not be re-configured until there is sufficient data available, as frequent reconfigurations may result in failures with severe impacts on peoples' health.

Challenges facing the use of Big Data Analytics in
Network Design

Network size vs Big Data Analytics gains
Depending on the network size, the ease of redesigning a network through the feedback cycle that we mentioned in Fig. 4 is highly affected by the number of nodes.
For instance, large data streams can be generated from the mass deployment of small Wireless Sensor Networks (WSNs) nodes and IoT [118].The collected data may not carry a meaningful value until it is effectively analyzed.However, analyzing or mining that immense amount of data demands finely tuned big data analytical capabilities, which turns out to be a challenging task [119].Furthermore, these massive amounts of data require hierarchal communication and data processing solutions.The planning of such deployments in conjunction with the data processing framework is a challenging task [118].
Comparing optical to IoT networks, the former has a small number of nodes, hence they are easier to redesign, while the latter has a larger number of connected objects, and that can impose a problem.

Security and privacy
Users' common patterns can be of great help.Network users can share certain patterns, like downloading some popular videos, retweeting about some certain upcoming game that would take place downtown, or even checking the same online channels.This information can be of a great value when used for network planning or optimization.However, to use this information, access to user data has to be obtained, which is a thought that may cause unrest for many.
When dealing with user data, there is always a flag raised, and that flag carries two issues: these issues are the security and the privacy of the data.This is why big data has to be protected from unauthorized access and release [35].
Big data security is a vital topic.If we want to label a system as "secured", it must meet the data security requirements, which are [120] : 1-Confidentiality: This implies the means to protect the data from unapproved disclosure.2-Integrity: This implies the measures taken to protect the data from being modified improperly or without permission.3-Availability: This is the system's ability to prevent and recover from hardware as well as software failures that might result in the database system being unavailable.Privacy of data is an increasing concern.As a matter of fact, having accessible data does not mean it is ethical to access it [121].Electronic health records have strict laws that precisely identify what can and cannot be accessed.
As an example, a user's location information can be tracked through cell towers and after a while, "a trail of crumbs" is going to be left by the user that could be used to link the user to their residence or office location, and to eventually determine the user's identity, private health information (e.g., attending a cancer treatment center) or religious preferences (e.g., attending a church) may be discovered by tracking the user's movement over time [122], especially when we take into consideration the close correlation between an individual's identity and their movement patterns [123].Some user data can be very valuable, for example, the estimated value of all global personal location data could reach $100 billion in revenue during the next 10 years for service providers, and when it comes to consumers and business end users, that figure can reach up to $700 billion [39].
With no obvious and secure way to handle the collected user data, big data analytics cannot be considered a reliable system.The security issues related to big data analytics can be divided into four concerns, starting with an input (e.g., handheld device, sensor, or even IoT device) where protecting the sensors from being compromised by attacks is regarded as an important security issue, as well as the other areas of data analysis, output, and communication with other systems [124].It should be noted that these concerns are present in all steps throughout the design cycle shown in Fig 4 .A solution that has been designed to address the big data security and privacy challenge is the integrated Rule-Oriented Data (iRODS [125]).This novel technology was designed to ensure security and privacy in big data, and it has some technological features such as federated data grid or "intelligent clouds", distributed rule engine, "iCAT" metadata catalogue, storage access layer that facilitates common access, two ways of interfacing graphical and command line, and APIs to interact with the iRODS data grid [35,125].
In a position paper, the authors of [126] noted a number of privacy-preserving challenges in the realm of big data analytics, and these challenges are classified as follows: 1-Individuals' Interaction: a. Transparency: Big data analytics is mostly associated with information collection and processing of specific individuals' data.However, this means that each individual is entitled to know about the data processing operations conducted on his/her data, and the challenging part is in allocating that specific piece of information linked to that person's identity b.Individual's Consent: According to many privacy laws, an individual is entitled to the right to be asked for his/her informed consent, and such consent is a way of ensuring the individual is aware of the type of processing that is conducted.This type of consent, along with the explanation it requires is in fact considered challenging.c.Consent Cancellation and Discarding Personal Data: Granting consent, on one hand, should also allow the right of revoking it.However, if an individual wished for his/her consent to be canceled, then this means all personal data has to be erased as well.This is a challenging requirement when considering the fact that the data might have been spread to various data collectors and data analysts.

Data center scalability
In the big data paradigm, data centers are not only a platform to concentrate data storage, but can also carry out further responsibilities, such as acquiring, managing, organizing, processing and leveraging data values and functions.That would encourage the growth of the infrastructure and related software [36].
The continuous expansion in data volume, coupled with the ever greater demand for faster processing speeds, and the increasing complexity of Relational Database Management System (RDBMS) are considered the main elements to motivate the hunt for expandable (scalable) data centers to handle the data volume and parallel processing requirements; hence, a number of technical challenges have to be taken into consideration when we try to design a scalable data center that can efficiently store, process, and analyze big data, these challenges can be mapped to the middle octagon (big data cluster) shown in Fig. 4, and they are: • Taking into consideration the variety and sheer volume of the disparate data sources, just collecting and integrating data with scalability from scattered locations is a difficult task to accomplish.• Massive datasets must be mined by big data analytics at different levels and in either a real time or near real time fashion.• Massive and heterogeneous datasets are to be stored and managed by big data systems while providing the function and performance guarantees needed in terms of fast retrieval, scalability, and privacy protection. .Facebook is a clear example, in that particular matter as it needs to store, access, and analyze over 30 petabytes of usergenerated data [39].
Although some might claim that the current problem is not about storage (large volume), but it is about the online processing ability [11], a scalable data center should also incorporate the ability to have a scalable storage system.Nonvolatile memory (NVM) technologies are expected to have a promising role in future memory/storage designs [127].
An ideal storage platform has three vital points (constraints) to meet: it should support efficient data access in case of failure (network partitions and node failures), offer its clients a consistent view of the data, and provides high-availability.However, according to Brewer's CAP theorem [128], this ideal system cannot exists, which is due to the fact that it is impossible for the consistency to be guaranteed and for highavailability to be offered in the presence of network partitions.As a result, one of the above constraints has to be relaxed by distributed storage systems [127].
When it comes to securing the required processing speed, Chip Multiprocessors (CMPs) are expected to be the computational plotter for big data analytics [127].Targeting the emerging trend, Datacenter-on-Chip (DoC) architectures were proposed by the authors of [39], with four usage models that depend on the state of the consolidating applications, if they were cooperating or not.Key scalability challenges were identified and addressed by cache hierarchies and shortage in performance isolation [127,129].

Open research directions
1-Processing minimization: The first step in the processing of big data is the collection of data and performing preprocessing.Data cleaning is one form of data preprocessing.One particular example where pre-processing might be implemented is using Computational Radio Frequency Identification (CRFID) sensors.In this approach, wireless sensors can be wirelessly powered using technologies like magnetically powered resonance [130], upon proximity to a moving collector object (e.g., a vehicle).This would enable the movement of some of the pre-processing tasks towards the CRFID sensors' side, thus collecting an already cleaned and reduced amount of data that is transferred to the relay before moving it to the data center for final processing.This would allow more efficiency, reduce the analysis time, allow for better storage utilization, and facilitate real-time analytics.As a result, it would lead to faster decision making and an optimally-optimized network.2-Facilitating satellite based Internet connectivity in highly populated and poor areas: Projects like SpaceX are already emerging with more than 4000 satellites and more than 1$ billion combined funding, the project announced by Elon Musk [131], intends to provide high-speed Internet satellites worldwide.By utilizing big data analytics in the field of satellite communication networks, this will focus more power in a selective fashion.The result is less signal-reception requirements, e.g., smaller antenna size and lower Block Up Converter (BUC) power in the above-mentioned areas.Big data analytics can be used to correlate ground data, e.g., geographical info and weather conditions, along with economy-related data to help identify these areas.3-Efficient use of idle time: Big data analytics can be used by operators to help them run their own data and discover patterns that would facilitate service and network optimization.However, analytics may not be a 24/7 job, especially if it is a batch process.Hence, this would leave the equipment and the software in an idle state.An operator may offer the use of his/her equipment to his/her clients from medium and small sized businesses.They could run their data during the idle time, which would offer better energy utilization, provide big data analytics for everyone, and create another source of money where everyone is benefiting.Game theory approaches can be harnessed here to coordinate resource provisioning among several providers.4-Analytics reuse: Cellular networks have high similarity in terms of equipment capabilities, specifications, subscriber requirements, and subscriber geographic distribution.
Those operators can benefit from other operator's big data analytics, thus the result of running the data can be applied directly, or after small modifications.For example by omitting the parts associated with different features of the two networks.This would reduce the purchasing cost, minimize the energy consumption, and reduce the optimization time by adopting a proven solution.Another challenge here is to provide a standard APIs between the different operators' equipment so they can access each other's data in an agreed up on manner.5-Big data and IoT node placement: The main cause for the increase in IoT sensors is the desire to collect more data, which -in turn-will result in reaching an enhanced control or comprehension.According to HP, by 2030, IoT sensors will reach one trillion, and this will make IoT data the most significant part of big data [36].However, gathering data efficiently requires placing the IoT sensors where they can harvest as much data as possible.Many sensors are simply wasted due to placing them in the wrong location (a location that will not be helpful in providing a valuable amount / type of data).Big data can be used to identify these IoT sensors and simply propose better locations, especially when coupled with other information, like weather conditions, social activities, and movement patterns.To reach the optimal IoT network design, big data analytics can correlate several parameters (e.g., traffic patterns, social events, network parameters, and whether conditions) to determine where the best locations are to place the sensors.6-Providing test environments for critical applications: Collecting large amounts of processed data may not be enough to proceed with network reconfiguration.This has to be considered for some critical applications (e.g., health care, military, and aerospace) where human lives could be jeopardized.The design cycle has to comprise an additional test environment in which the proposed design modification has to undergo a certain test cycle before being put to work, although this might postpone the ratification of the newly-proposed design.There will always be a trade-off between accuracy and speed.It is true that waiting for sufficient data to be accumulated would pay off as a better decision-making step, but that rule is not suitable when it comes to critical applications (e.g., medical networks).The design cycle has to undergo a thorough test first.Identifying these applications and providing suitable test environments is a very important task.7-Selecting the most efficient energy source for network nodes: Another aspect that can be added for a greener network is the ability to selectively utilize energy sources based on the correlation of energy source attributes and their ability to serve a particular task.For example, solar energy source can be ideal for outdoor usage during the day when it is sunny, with a backup plan to switch to other sources (i.e., electrical) during special events or bad weather conditions.This can be the case for IoT devices scattered in a business district, where they are mostly utilized during the day, while running idle after the usual office hours.

Conclusions
There are many areas in which big data analytics can be utilized in the network design process.The concept of gathering network data and correlating them with user trends and service requirements can indeed create an adaptive and user-centric network design.
Throughout our survey, we noticed a lot of focus on the field of wireless communication networks design using big data.Delving deeper reveals that the field of 5G is getting the majority of the researchers' attention due to the new opportunities it has to offer.The optical networking, inter-DC and SDN fields, on the other hand, have yet further research challenges to tackle.We also note that the integration of SDN and big data analytics would facilitate the perfection of the design cycle.The field of network security also has its share where big data analytics is utilized to detect security threats.
Industrial efforts toward optimizing networks based on big data analytics reflect the increasing trend toward employing AI-like approaches, such as pattern recognition and machine learning for network design.Some of the considered solutions handle big data in a batch manner while others are capable of performing real-time processing.Handling big data in a batch mode can offer more accurate information at the expense of delayed results due to the size of the processed data, while real-time processing offers fast results at the expense of accuracy.Hence, it would be an application-dependent decision whether to choose the former or the latter option.
We predict that the field of network design based on big data analytics will continue to flourish in the near future as more data are collected from the networks and processed to extract useful information regarding network behavior.In the far future, or maybe quite soon, as some claim, employing quantum computing for machine learning purposes could help in dethroning Moor's law and provide more processing space per unit time.This extra space can be harnessed for big data analytics employed in network design.

Fig. 3 .
Fig. 3. Percentage of surveyed research topics according to subject area.

Table 1 :
Various big data dimensions.

Table 3 :
Big data analytics-powered industrial solutions.