A Novel Framework of Data-Driven Networking

Many communities have researched the application of novel network architectures, such as content-centric networking (CCN) and software-defined networking (SDN), to build the future Internet. Another emerging technology which is big data analysis has also won lots of attentions from academia to industry. Many splendid researches have been done on CCN, SDN, and big data, which all have addressed separately in the traditional literature. In this paper, we propose a novel network paradigm to jointly consider CCN, SDN, and big data, and provide the architecture internal data flow, big data processing, and use cases which indicate the benefits and applicability. Simulation results are exhibited to show the potential benefits relating to the proposed network paradigm. We refer to this novel paradigm as data-driven networking.


I. INTRODUCTION
The current Internet architecture established from TCP/IP has gained huge success and been the one of indispensable infrastructures for our daily life, economic operation and society.However, burgeoning mega trends in the information and communication technology (ICT) domain are urging the Internet for pervasive accessibility, broadband connection and flexible management, which call for potential new Internet architectures.The original design tactic of Internet, which is ''Leaving the complexity to hosts while maintaining the simplicity of network'', leads to the almost insurmountable challenge known as ''Internet ossification'': software in the application layer has developed rapidly, and abilities in the application layer have been drastically enriched.By contrast, protocols in the network layer lack scalability and the core architecture is hard to modify, which means that new functions have to be implemented through myopic and clumsy ad hoc patches in the existing architecture.For example, the transition from IPv4 to IPv6 is difficult to deploy in practice [1].
To improve the performance of the current Internet, novel network architectures have been proposed by the research communities to build the future Internet, such as Content-Centric Networking (CCN) and Software-Defined Networking (SDN).CCN is a rising networking paradigm centered on content distribution rather than host-centric connectivity [2].SDN is another paradigm, which separates the control plane from forwarding plane, breaks vertical integration, and introduces the ability of programming the network [3].
In addition, big data has won great attentions in terms of academia and industry.Big data represents the data sets are so large and complex that traditional data management tools or processing methods are inadequate to manage and analyze them.Big data is popularly characterized by ''5Vs'' (initially it was described as ''3Vs'', and two have been added recently): Volume (the size of data sets), Variety (the range of data types and sources), Velocity (the speed of data in and out), Value (how useful the data is), and Veracity (the quality of data) [4].It could bring lots of advantages to networking, such as management (recognize-requirement) and intelligence (recognize-variation), and has the potential to make sure that we operate and manage data networks.In this regard, some excellent works have explored the role of big data in the traditional networks.In the work of [5], it introduces and exploits the features and categories of mobile data, which is extremely beneficial to wireless networks.On this basis, some big-data-enabled architectures in wireless network are proposed in [6] and [7].In other networks, big data also plays a key role, such as [8]- [10].
Albeit many splendid researches have been done on CCN, SDN, and big data for networks, they all have been addressed separately in the traditional literature.Nevertheless, as listed in the following, it is necessary to consider them together to offer better services in the future network.
• Firstly, CCN has been considered as the one of promising architectures for efficient content distribution around the Internet.This paradigm shifting from host-centric to content-centric has many alluring advantages, for example the reduction of network load, low latency and so on.Currently, there are increasing number of researches in this field, such as NDN [11], PURSUIT [12], SAIL [13], and so on.The major feature of CCN is in-network caching [14].This good feature has significant impacts on providing content to users in SDN.In this paper, we propose to add caching capacity in SDN switches, which enables to cache content so as to reduce the content response time and provide improved users experiences.When a certain content is sent to reply a user's request, this content will be cached in SDN switches along the way back to this request originator.With the innetwork caching capacity, the performance of SDN can be improved in terms of content distribution latency.
• Secondly, SDN can contribute to the promotion of CCN.One of biggest challenges for global optimal network and content cache management of CCN are inherently distributed, where every node has only a partial view.In particular, SDN has centralized control plane, decouples control from data plane, which will help CCN allocate cache resources, distribute content, and configure networks globally.For example, by knowing that a switch needs more cache resources, the control layer can send flow tables to the infrastructure layer to allocate more caches to this switch.
• Finally, big data has profound impact on the design and operation of SDN.Particularly, with the global view of the network, the logically centralized controller in SDN can obtain big data from all the different layers (i.e., from infrastructure layer to application layer) with arbitrary granularity.Using big data analysis in the control layer of SDN, we can extract knowledge out of the large volumes of data to help the controller make decisions.For example, with big data analysis, we will know which content in a certain switch has the high popularity.Based on the analysis results, the control layer enables to redistribute content, which makes the required content close to the specific users.For the above reasons, we adopt the centralized control by SDN, combined with distributed in-networking caching provided by CCN.In the context, big data analysis can extract knowledge about network, and send the knowledge to control the whole network by centralized control capacities offered by SDN.We refer to this new paradigm gathering with SDN, CCN and big data analysis as Data-Drive Networking (DDN).
The rest of this paper is organized as follows.We describe the Data-Driven Networking (DDN) paradigm and how it operates in Section II.In Section III, data flow in Data-Driven Networking is discussed.Section IV shows the data processing technology.Then we describe relevant use cases to show the validity and applicability of DDN paradigm in Section V.At last, we conclude the study in Section VI.

II. DATA-DRIVEN NETWORKING ARCHITECTURE
The reference model for the Data-Driven Networking is shown in Fig. 1.This model consists of three layers, namely the infrastructure layer, the control layer and the application layer, with two interfaces (i.e., south interface and north interface) in a bottom-up manner.The infrastructure layer is responsible for storing content, monitoring and forwarding data packets.We embed content caches in the traditional SDN switches to provide content for users, which achieves significant reduction in content response time and provides improved users' experiences.When users' requests arrive at a switch, this switch firstly finds the needed content in its own cache.If the needed content is in this switch cache, it will quickly respond to users' requests.Otherwise, the switch will retrieve content from the source by packet forwarding.Besides, there are monitor agents in content caches.They are responsible for collecting content data, cache data, and network data, and sending them to the big data analysis module in control layer through the south interface.Finally, the same as traditional data plane in SDN, switches are composed of forwarding hardware, operate unaware of the network according to flow tables and update the configuration.And they are enable to adjust the size and content of cache based on instructions from the control layer, which ensures the reasonable cache reallocation and content redistribution.
The control layer connects the infrastructure layer and the content service layer, via the two interfaces.The control layer is the most significant core in this architecture, in which the complexity resides.It consists of the below two aspects, namely big data analysis module and ordinary SDN controller module.
The application layer includes diverse application services to satisfy users' requirements.Taking advantage of north interface and control layer, applications enable to access the global network view and programmatically implement strategies to leverage the physical networks at the infrastructure layer using the high level language.
The cooperation with each components will be described in the following section.

A. FROM SDN SMART CACHE SWITCH TO BIG DATA ANALYSIS MODULE
Big data analysis module aims to collect enough data with arbitrary granularity to complete the view of network.Therefore, monitor agents in switches gather the content data, cache data, and network data, send them to big data analysis in control plane through the south interface in real time when switches forward packets.The most related data gathered by switches is as follows.
• User data: user data of cache data, content data and history request data from SDN smart cache switches.In detail, cache data includes the total cache size and remaining cache size in each cache.Content data covers content types (e.g., text, picture or video), specific content information (e.g, learning text about mathematics, landscape painting about Big Ben, or movie about Interstellar), bandwidth used for transferring this content, and content requirements for time delay and packet loss rate.Meanwhile, history request data is mainly about the request times of each user aiming at each content.User data is described in Table 1.The volume of above data is terribly huge.According to the latest data released by the International Telecommunication Union (ITU), the number of global Internet users was 3.2 billion by the end of 2015, accounting for almost 40 percent the world's population.In the SDN paradigm, the data related to control information of users is greatly enormous.Meanwhile, the content of the Internet is more abundant.In 2014, the of global websites was more than 1.06 billion, and this number is still growing.In China, as of June 2015, according to the data from the 36th China Internet Network Development State Statistic Report, the number of websites was 3.57 million.The average number of webpages of a website was 46,900, with the increase of 2.3 percent, and the number of bytes per webpage was 50KB, with the increase of 19 percent, which show that the content of the Internet is greatly rich.Briefly, user data and network data are tremendous, and the network needs big data analysis.
The big data analysis module structures a collective intelligence data architecture with built-in analytics capabilities and its detailed processing procedure will be presented in Section IV.After gathering data from switches, big data analysis module extracts knowledge out of the large volumes of data, which has influence on building traffic model, users' behavior model and content popularity model so as to comprehend network status, users' demand and heat of content.Based on the cache data, we use big data analysis to know which switch has a higher frequency of content request in order to redress cache allocation.Based on the content data, we utilize big data analysis to know which content in a certain switch has a higher demand in order to readjust content distribution.Based on the network status and big data analysis, we know which links are easier to congest or which switches are more possible to loss packets in order to make reasonable forwarding decisions.

B. FROM BIG DATA ANALYSIS MODULE TO SDN CONTROLLER
After making decisions such as cache reallocation, content redistribution and network management policies, big data analysis modules send them to SDN controller.Then SDN controller translates these policies into specific control actions.Because SDN controller has the global network view and manages the whole devices in data plane uniformly.In this sense, SDN controller enables to form the accurate control actions to execute analytical results.Finally, SDN controller packages actions into flow tables.

C. FROM SDN CONTROLLER TO SDN SMART CACHE SWITCH
SDN controller sends flow tables to the switches through the southbound interface to configure the infrastructure layer on the basis of decisions made by big data analysis.In detail, switches' behavior and caches' behavior will be changed, including packet forwarding rules, cache allocation and content distribution, in order to achieve maximum link and cache resource utilization, optimize content distribution and minimize link congestion.

IV. BIG DATA PROCESSING PLATFORM
The big data processing platform in this architecture consists of the following three parts: data cleaning, data storage, algorithm set that includes big data algorithm set and optimization algorithm set.This big data processing platform is shown in Fig. 3. Firstly, the multi-source data comes from different traffics, different hardware platforms or different operating systems and so on.Inevitably, qualities issues of data exist in the system.For example, similar or repetitive data abnormal data and incomplete data, which all can be called as ''dirty data''.And data cleaning is the final mechanism that enables to find and correct these dirty data in data files.The procedure of data cleaning is shown in Fig. 4. Based on analytical results of causes and existing forms of dirty data, we detect dirty data in the system by technical methods.Finally the detected dirty data is transformed into normal data that meets the requirements of data quality.The ideology of data cleaning is backtracking, which analyzes data set from data source and detects every places of data set going through so as to extract data cleaning algorithms, rules and polices.Finally, these algorithms, rules and polices use the data set to find dirty data and clean dirty data.After data cleaning, the accuracy of data can be improved.Secondly, after data cleaning, the data set is stored for further processing.Since the data size in this architecture is greatly huge, SQL is not suitable.We consider NoSQL as a better choice in this architecture.NoSQL is shortened from Not Only SQL, which adopts flexible, distributed, extensible data storage management to meet the needs of big data storage.According to the classification of storage model, NoSQL consists of extended column storage, such as Google BigTable, HBase, Cassandra, Key-Value storage, such as Redis, BerkerlyDB, graph storage, such as Neo4J, Infinite Graph, document storage, such as MongoDB, CouchDB, and so on.
Thirdly, there are lots of big data algorithms in big data algorithm set, such as C4.5, K-means, Support vector machines(SVM), Apriori, EM, PageRank, AdaBoost, KNN, Naive Bayes, CART and so on.There are also lots of optimization algorithms in the optimization algorithm set, such as conjugate gradient algorithm, steepest descent algorithm, penalty function algorithm, simplex algorithm, newton algorithm, levenberg-marquardt algorithm, variable elimination algorithm, and gradient projection algorithm.Meanwhile, the algorithm set has the external interface by which new algorithm can be added in the set.Different types of data enter the algorithm set.According to its own characteristics, different data chooses the appropriate big data algorithm in the algorithm set.According to different optimization objectives, different data chooses the appropriate optimization algorithm in the algorithm set.Through the big data algorithm and optimization algorithm, aiming at some optimization objectives, we are enable to get the optimal solutions.

V. USE CASES
In this section, we present the specific use case which explains the workflow and specific example of DDN.We also give simulation performance of DDN paradigm.

A. WORKFLOW
A user submits a content request to its nearest SDN smart cache switch.If the content is cached in this switch or this switch knows how to obtain the content from other switches based on the flow tables, this switch will reply user's request immediately or get the required content based on the flow table to reply user's request.Otherwise, this switch will upload the request to the controller, which asks for the flow tables to reply user's request.This process is just the same as traditional SDN and there is no need for big data analysis module to make decisions aiming st this flow.Meanwhile, the monitor in this switch records request frequency of this content and whether this request is hit.At stated times, the big data analysis module gathers data from the infrastructure layer, comprehends network status, performance and users behavior, and sends the results of learning to the controller, which will have positive impacts on entire network management, cache resource management and content distribution management.For example, a certain user sends a content request of film ''Avatar'' to its nearest SDN smart cache switch.Unfortunately, Avatar is not cached in this switch, and this switch doesn't know how to get Avatar based on the flow tables.So this switch uploads this request to the controller to ask for the flow tables, and gets Avatar based on the flow tables.Meanwhile, after a period of time, the big data analysis module learns that the users connected to this switch usually ask for Avatar, and tells this result of learning to the controller.The controller sends the flow tables to cache Avatar in this switch so as to reduce users' waiting time for Avatar.

B. SPECIFIC EXAMPLE
For example, if our objective is to achieve minimal network flow by addressing the questions of how each content router caches content in the network and how to allocate limited cache resource among content routers, the total problem can be formulated as follows: To begin, we define the following quantities that can be pre-computed or defined.3) C sum : it is the sum of cache size of all caches in the system.4) S k : it is the size of the content object O k .5) d ijk : it is the hop distance by node i to request content object O k from node j. 6) q k i : it is the request rate for content object O k at node i.We introduce the following variables: 1) y ijk : y ijk takes the value of 1 if node i downloads a copy of content object O k from node j, and 0 otherwise.2) x ik : x ik takes the value of 1 if node i caches a copy of content object O k , and 0 otherwise.3) C t i : the cache size of node i at t time.Then, the formulation is: The first constraint specifies that each content router i downloads object k from only one content router.The second constraint specifies that content router i downloads object k from content content router j only when content object is located there.Then, the third constraint M k=1 x ik S k ≤ C t i , ∀i specifies that the total size of content objects located in content router i can't exceed the current maximum cache size.Finally, the constraint N i=1 C t i = C sum , ∀t specifies that the sum cache size of all content routers remains the same.
Leveraging the big data algorithm and optimization algorithm in the algorithm set, we can find the optimal solution to this problem.Finally, the big data processing platform exports optimal solutions to the controller.By these optimal solutions, the controller manages traffic, content allocation and cache distribution.
With that consideration, we implement the experimental simulation.Based on the Open Shortest Path First (OSPFN) routing protocol, the performance of the proposed architecture is evaluated in the ndnSIM 1.0 simulator and Hadoop.
In the ndnSIM, we simulate the request, reply, distribution of content, data encapsulation and content forwarding.After that, we obtain the necessary user data from ndnSIM and send to Hadoop.With the help of big data analysis in Hadoop, we can extract knowledge out of the large volumes of data from ndnSIM and get the decision about how to allocate cache resource, distribute content.Finally, resend those decision to the ndnSIM to reallocate cache resource and redistribute content.

1) SIMULATION SETTINGS
• Network Topologies: The simulation is carried out in the Power-Law topology including 64 content routers.
• Input Data: There is 200 different content in the simulation.We assume each object has the same size and the content popularity follows the Zipf distribution, in which the skewness factor α = 0.8 [15].
• Cache Size: We abstract the cache size for each content router as a proportion that the cache size is defined as the relative size to the total amount of different content in the network.We evaluate the network performances for each caching scheme when the cache memory size varies from 1% to 10%.

2) PERFORMANCE EVALUATION RESULTS
Fig. 5 shows the average response hops of different cache policies when the content popularity varies.From Fig. 5, we can observe that the content popularity has some effects on the delay of each cache scheme.When the content popularity increases, the number of popular content changes in the same way, which improves the average response hops of each solution.Besides, it also makes the gap among each scheme smaller, because the influence of different cache policies with a higher content popularity is weakened.However, the proposed solution always has a better performance, because the cache decision is made based on the global knowledge, which makes the network better adapt to the change of the network.Fig. 6 shows the average response hops of different cache policies when cache size varies.From Fig. 5, we can observe that the cache has some effects on the delay of each cache scheme.When the cache size increases, more popular content is cached, which makes the average response hops of each solution smaller.Besides, it also makes the gap among each scheme larger, because the influence of different cache policies with a larger memory becomes more obvious.However, the proposed solution always has a better performance, because the cache decision is made based on the global knowledge, which makes the network cache optimal content in the extra buffer.

VI. CONCLUSIONS AND FUTURE WORK
In this paper, we have proposed a novel data-driven network architecture, in which in-network caching is added in the infrastructure layer of SDN and a big data analysis module is added in the control layer of SDN.In particular, each SDN switch has the caching capability to facilitate efficient content distribution.With the help of centralized SDN controller, the system with big data analysis can be aware of the information about users, content and network to realize optimal resource allocation, efficient content distribution and flexible network configuration.Simulation results were presented to show that this novel network architecture can efficiently improve the network performance compared to the tradition schemes.Future work is in progress to considering cloud/fog computing in the proposed network architecture.

FIGURE 1 .
FIGURE 1. Architecture reference model: a three layer model, ranging from the infrastructure layer to the control layer to the application layer.
III. DATA FLOW IN DATA-DRIVEN NETWORKINGData-Driven Networking paradigm runs under the data flow.This is the brightest different between existing works which have been done combing CCN or ICN with SDN, namely the flows from SDN smart cache switch to big data analysis module, from big data analysis module to SDN controller, and from SDN controller to SDN smart cache switch.Fig.2only indicates the emerging data flow in DDN system.Some original data flows in traditional SDN such as from data plane to control plane are not listed.In what follows we describe these flows in detail.

FIGURE 3 .
FIGURE 3. The big data processing platform.

FIGURE 4 .
FIGURE 4. The procedure of data cleaning.
• Network data: it includes the physical, topological states of network, especially flow tables.A flow table consists of header field, counter and actions.The counter supports data statistics at four granularities: flow table, flow, port and queue.It is used to count traffic information, such as active entries, packet lookups and received bytes.The statistics in flow table is shown in Table 2.

TABLE 2 .
The statistics in flow table.