An Approach Based on Knowledge-Deﬁned Networking for Identifying Heavy-Hitter Flows in Data Center Networks

: Heavy-Hitters (HHs) are large-volume ﬂows that consume considerably more network resources than other ﬂows combined. In SDN-based DCNs (SDDCNs), HHs cause non-trivial delays for small-volume ﬂows known as non-HHs that are delay-sensitive. Uncontrolled forwarding of HHs leads to network congestion and overall network performance degradation. A pivotal task for controlling HHs is their identiﬁcation. The existing methods to identify HHs are threshold-based. However, such methods lack a smart system that e ﬃ ciently identiﬁes HH according to the network behaviour. In this paper, we introduce a novel approach to overcome this lack and investigate the feasibility of using Knowledge-Deﬁned Networking (KDN) in HH identiﬁcation. KDN by using Machine Learning (ML), allows integrating behavioural models to detect patterns, like HHs, in SDN tra ﬃ c. Our KDN-based approach includes mainly three modules: HH Data Acquisition Module (HH-DAM), Data ANalyser Module (HH-DANM), and APplication Module (HH-APM). In HH-DAM, we present the ﬂowRecorder tool for organizing packets into ﬂows records. In HH-DANM, we perform a cluster-based analysis to determine an optimal threshold for separating HHs and non-HHs. Finally, in HH-APM, we propose the use of MiceDCER for routing non-HHs e ﬃ ciently. The per-module evaluation results corroborate the usefulness and feasibility of our approach for identifying HHs.


Introduction
Today's data centres networks (DCNs) have become an efficient and promising infrastructure to support a wide range of technologies, network services and applications such as multimedia content delivery, search engines, e-mail, map-reduce computation, and virtual machine migration [1]. Despite the well-known DCNs capabilities, scalability and traffic management are still challenges. Both scalability and management are growing in complexity especially due to the exponentially increasing volume and heterogeneity of the network traffic. DCNs are based on networking paradigms as Network Virtualization and Software-Defined Networking (SDN) which are promising solutions to address these challenges [1]. In particular, SDN separates the control plane from the data plane, enabling a logically centralised control of network devices [2,3]. By moving the control logic from the forwarding devices to a logically centralised device, knowing as the Controller, network devices become simple forwarders

Background
In this section, we give a brief background on the main domains our approach covers. We start with HH definition in Section 2.1. Then, Section 2.2 provides a brief overview of the SDN paradigm, its architecture, and main features. Section 2.3 presents a brief SDDCN overview while Section 2.4 describes ML. Finally, in Section 2.5, we introduce KDN-based.

Heavy-Hitter Flows
A flow is defined as a set of packets passing an observation point during a specific time interval [26]. Packets sharing certain attributes belong to the same flow. Usually, such attributes are the source and destination IP addresses, source and destination port numbers, and the protocol identifier. Studies about flow behaviour show that a very small percentage of flows carry the bulk of the traffic, especially in DCN. These flows are often termed as HH flows [27][28][29]. Generally, HHs can be classified by using different metrics based on its duration, size, rate and burstiness. Each flow can be classified into two groups-HH and non-HH. This dichotomy of the flow types is achieved using a threshold which varies depending on the classification metric used.
Lan and Heidemann [28] provide a definition of flow types within each category with a zoological fair. For instance, flows that have a duration longer than a certain time period are tortoises, otherwise they are termed as dragonflies. Flows that have a size larger than s B (bytes) are elephants, in turn, mice are flows that have a size less than or equal to s. Cheetahs are flows with a rate greater than r Bps while snails are flows with a rate less than or equal to r. Flows with burstiness greater than b ms are called porcupines while those with burstiness less than or equal to b are stingrays. Table 1 summarises the main characteristics of HH flows as described by Lan and Heidemann [28]. In general, tortoise flows do not consume a lot of bandwidth. Elephants flows are long-lived and have a large size, but they are neither fast nor bursty. Cheetahs are small and bursty. The occurrence of porcupine flows is very likely due to the increasing trends in downloading large files over fast links. The rest of the paper uses the flow size feature for defining HHs. Table 1. Taxonomy of Heavy-Hitter (HH) flows as per Lan and Heidemann [28].

Software-Defined Networking
The SDN architecture comprises four planes [30][31][32]: Data Plane, Control Plane, Application Plane, and Management Plane. The Data Plane includes the interconnected forwarding devices. These devices are typically composed of programmable forwarding hardware. Furthermore, they have local knowledge of the network, and rely on the Control Plane to populate their forwarding tables and update their configuration.
The Control Plane consists of one or more NorthBound Interfaces (NBIs), the SDN Controller, and one or more SouthBound Interfaces (SBIs). NBIs allow the Control Plane to communicate with the Application Plane and provide the abstract network view for expressing the network behaviour and requirements. The Controller is responsible for programming the forwarding elements via SBIs. An SBI allows communication between the Control Plane and the Data Plane by providing: programmatic control of all forwarding operations, capabilities advertisement, and statistics reporting [33]. The Application Plane includes network programs that explicitly, directly, and programmatically communicate their requirements and desired network behaviour to the SDN Controller via NBIs. Finally, the Management Plane ensures accurate network monitoring to provide critical network analytics. For this purpose, Management Plane collects telemetry information from the Data Plane while keeping a historical record of the network state and events [34].

Software-Defined Networking Data Centre Networks
A typical DCN comprises a conglomeration of network elements that ensures the exchange of traffic between machines/serves and the Internet. These networks elements include servers that manage workloads and respond to different requests, switches that connect devices, routers that perform packet forwarding functions, gateways that serve as the junctions between the DCN and the Internet [1]. Despite the importance of DCNs, their architecture is still far from being optimal. Traditionally, DCNs use dedicated servers to run applications, resulting in poor server utilisation and high operational cost. To overcome this situation the emergence of server virtualisation technologies allows multiple virtual machines (VMs) to be allocated on a single physical machine. These technologies can provide performance isolation between collocated VMs to improve application performance and prevent interference attacks. However, server virtualisation itself is insufficient to address all limitations of scalability and managing the growing traffic in DCNs [1,35].
Motivated by the limitations aforementioned, there is an emerging trend towards the use of networking paradigms as SDN in DCNs, also known as SDDCN. An SDDCN combines virtualized compute, storage, and networking resources with a standardised platform for managing the entire integrated environment. Following Faizul Bari et al. [1], the major foundations of an SDDCN are: • Network virtualisation-combines network resources by splitting the available bandwidth into independent channels that can be assigned or reassigned to a particular server or device in real-time.

•
Storage virtualisation-pools the physically available storage capacity from multiple network devices. The storage virtualisation is managed from a central console.

•
Server virtualisation-masks server resources from server users. The intention is to spare users from managing complicated server-resource details. It also increases resource sharing and utilisation while keeping the ability to expand capacity. Figure 1 shows a SDDCN with a conventional topology. In this SDDCN, the controller (or set of controllers) and the network applications running on it are responsible for handling the data plane. This plane includes a fat-tree topology that is composed by Top-of-Rack (ToR), Edge, Aggregation, and Core switches. ToR switch in the access layer provides connectivity to the servers mounted on every rack. Each aggregation switch in the aggregation layer (sometimes referred to as the distribution layer) forwards traffic from multiple access layer switches to the core layer. Every ToR switch is connected to multiple aggregation switches for redundancy. The core layer provides secure connectivity between aggregation switches and core routers connected to the Internet [6].
Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 20 and high operational cost. To overcome this situation the emergence of server virtualisation technologies allows multiple virtual machines (VMs) to be allocated on a single physical machine. These technologies can provide performance isolation between collocated VMs to improve application performance and prevent interference attacks. However, server virtualisation itself is insufficient to address all limitations of scalability and managing the growing traffic in DCNs [1,35].
Motivated by the limitations aforementioned, there is an emerging trend towards the use of networking paradigms as SDN in DCNs, also known as SDDCN. An SDDCN combines virtualized compute, storage, and networking resources with a standardised platform for managing the entire integrated environment. Following Faizul Bari et al. [1], the major foundations of an SDDCN are: • Network virtualisation-combines network resources by splitting the available bandwidth into independent channels that can be assigned or reassigned to a particular server or device in realtime.

•
Storage virtualisation-pools the physically available storage capacity from multiple network devices. The storage virtualisation is managed from a central console.

•
Server virtualisation-masks server resources from server users. The intention is to spare users from managing complicated server-resource details. It also increases resource sharing and utilisation while keeping the ability to expand capacity.  Figure 1 shows a SDDCN with a conventional topology. In this SDDCN, the controller (or set of controllers) and the network applications running on it are responsible for handling the data plane. This plane includes a fat-tree topology that is composed by Top-of-Rack (ToR), Edge, Aggregation, and Core switches. ToR switch in the access layer provides connectivity to the servers mounted on every rack. Each aggregation switch in the aggregation layer (sometimes referred to as the

Machine Learning
ML includes a set of methods that can automatically detect patterns in data, aiming to use the uncovered patterns to predict future data, and, consequently, to facilitate the decision-making processes [36]. In the networking context, some possibilities arise from using ML: (i) forecast behaviour in the network [37,38], (ii) anomaly detection [39,40], (iii) traffic identification and flow classification [17,41]; and (iv) adaptive resource allocation [42,43]. For more information about the ML potential in the networking field, we refer the reader to Ayoubi et al. [44] and Boutaba et al. [45].
Overall, ML can be divided into Supervised Learning (SL), Unsupervised Learning (UL), Semi-supervised Learning (SSL), and Reinforcement Learning (RL). SL focuses on modelling the input/output relationships through labelled training datasets. The training data consists of a set of attributes and an objective variable also called class [44]. Typically, SL is used to solve classification and regression problems that pertain to forecast outcomes. For instance, prediction of traffic [46], end-to-end path bandwidth [47], or link load [48]. Unlike SL, UL uses unlabeled training datasets to create models that can discriminate patterns in the data. UL can highlight correlations in the data that the Administrator may be unaware of. This kind of learning is most suited for clustering problems. For instance, flow feature-based traffic classification [49], packet loss estimation [50], and resource allocation [51].
SSL occupies the middle ground, between supervised learning (in which all training data are labelled) and unsupervised learning (in which no label data are given) [52]. Interest in SSL has increased in recent years, particularly because of application domains in which unlabelled data are plentiful, such as classification of network data using very few labels [53], network traffic classification [54], and verification networks [55]. RL is an iterative process in which an agent aims to discover which actions lead to an optimal configuration. To discover the actions, the agent is status aware of the environment and takes actions that produce changes of state. For each action, the agent can receive or not receive a reward. The reward depends on how good the action taken was [56]. RL is suited for making cognitive choices, such as decision making, planning, and scheduling. For instance, the routing scheme for delay tolerant networks [50], as well as multicast routing and congestion control mechanisms [57].

Knowledge-Defined Networking
In 2003, Clark et al. [58] suggested the addition of a Knowledge Plane (KP) to the traditional computers network architecture formed by the Control Plane and Data Plane [33]. KP adopts Artificial Intelligence to perform tasks that are human intelligence characteristic, i.e., systems with the ability to reason, discover meaning, generalise, or learn from past experiences [36]. To achieve these abilities, KP proposed the use of ML techniques that offer advantages to networking, such as automation processes (recognise-act), recommendation systems (recognise-explain-suggest), and data prediction. These advantages bring the possibility of having a smart network operation and management [34].
Nowadays, the possibility of improving the way to operate, optimise and troubleshoot computer networks by using KP is nearer than fifteen years ago because of two main reasons. Firstly, SDN offers full network view via a logically centralised Controller. Furthermore, SDN improves the network control functions that facilitate the gathering of information about the network state in real-time [34]. Secondly, the capabilities of network devices have significantly improved, facilitating the gathering of information in real-time about packets, processing time, and flow-granularity [33].
The addition of KP to the traditional SDN architecture is called KDN [34]. It comprises four planes: Data Plane, Control Plane, Management Plane, and KP. The Data Plane is responsible for generating metadata by the forwarding network devices. The Control Plane provides the interfaces to receive the instructions from the KP; then the Controller transmits the instructions to forwarding devices. Furthermore, the Control Plane sends metadata to the Management Plane about the network state. In the Management Plane, the metadata from Control and Data Planes are collected and stored. The Management Plane provides a basic analysis of statistics per-flow and per-forwarding devices to the KP. Finally, KP sends to the Controller one or a set of instructions about what the network is supposed to do [59].
KDN works by employing a control loop. Formally, a control loop can be described as a system that is used to maintain the desired output, in spite of environmental disturbances. Overall, the components of a control loop include a Data Acquisition Module (DAM), Data Analyser Module (DANM), and APplication Module (APM). In KDN, DAM comprises the Data and Control Planes, DANM contains both the Management Plane and the ML Module from KP, and the APM includes the Decision Module from KP [59].

Related Work
In SDN, the HHs identification has been addressed from different network places: Controller, Switch, and Host. The Controller-based approaches compare the flow size statistics with a static and predefined threshold to identify HHs. There are two approaches to obtain the flow size statistics: pulling, and sampling. In pulling, the central controller maintains statistics (e.g., packets, bytes, and duration time) provided by OpenFlow [15,19,60]. Also, the Controller maintains the network state by sending Read-State messages periodically to the Data Plane. Sampling reduces the original traffic data characteristics by extracting the representative traffic data part. The sampling of flows is a trade-off between data reduction and preserving the details of the original data [61].
In the literature, the switch-based approach moves the task of HHs identification from the Controller to the switches. This identification introduces new functionalities in switches to record flows size statistics. Then, the flow sizes are compared with a static and predefined threshold at the end-switch [2,62]. In Host-Based identification, when the measurement of a flow (e.g., socket buffer and flow size) exceeds a previously set threshold value, the detector determines if the flow is a HH or not [24,35,63]. It is important to highlight that the use of a static and pre-defined threshold offers a rapid identification but low accuracy when the traffic is dynamic and grows suddenly. Table 2 summarises the main shortcomings for the different approaches. Peng, X et al. [2] Host Trestian et al. [24] Medium Low √ Amezquita-Suarez et al. [35] Afaq et al. [63] 1 Hardware Modifications, 2 Software Modifications In short, the HHs identification approaches that use per-flow statistics collection in SDDCNs [15][16][17][18][19] yield both relatively high overhead and low granularity of traffic flow semantics. To avoid network overload and latency, some approaches, such as [17,18,20], utilise sampling for data collection. Unfortunately, since sampling does not take into account the packet size, large packets can be missed, resulting in a large error in the HHs identification. Usually, the switch-based approaches can only be realised by hardware modifications [21]. This fact contradicts the softwarization principle of SDN.
On the other hand, despite that host-based approaches reduce the overhead on the network, they require modification of the operating system in each host leading to scalability issues [22][23][24].

KDN-Based Heavy-Hitter Identification Approach
This section details our approach. An overview is presented in Section 4.1. The architecture and modules are introduced and evaluated in Section 4.2.

System Overview
The current methods to identify HHs are threshold-based. However, such methods lack a smart system that efficiently identifies HHs according to the network behaviour. Hereinafter, we introduce an approach to overcome this lack and investigate the feasibility of using KDN to identify HHs. Figure 2 overviews our approach by a KDN control loop, including three modules: HH-DAM, HH-DANM, and HH-APM. Overall, HH-DAM sends packets captured to HH-DANM, which is responsible for storing and generating a network traffic state model. Finally, the Controller gets the awareness from HHs Flag sub-module in HH-APM and instructs the new network configurations to the forwarder devices. In a high-abstraction level, our approach operates as follows.

1.
Forwarding Devices → Packet Observation. Packet Observation performs packets capture from Observation Points in the network devices, e.g., line card or interfaces of packet forwarding devices. Before starting to send the packets to the Data Collector, the packets can be pre-processed, trough sampling and filtering rules.

2.
Packet Observation → Data Collector. In the Data Collector, the packets provided by the HH-DAM are organised and stored into flows. This Collector aims at gathering enough information to offer a global view of the network behaviour.

3.
Heavy  [34]. For instance, the model can learn adaptively according to the traffic change, and find the optimal configuration to routing HHs and, thus, avoid congestion. The current methods to identify HHs are threshold-based. However, such methods lack a smart system that efficiently identifies HHs according to the network behaviour. Hereinafter, we introduce an approach to overcome this lack and investigate the feasibility of using KDN to identify HHs. Figure 2 overviews our approach by a KDN control loop, including three modules: HH-DAM, HH-DANM, and HH-APM. Overall, HH-DAM sends packets captured to HH-DANM, which is responsible for storing and generating a network traffic state model. Finally, the Controller gets the awareness from HHs Flag sub-module in HH-APM and instructs the new network configurations to the forwarder devices. In a high-abstraction level, our approach operates as follows.

Forwarding Devices → Packet Observation. Packet Observation performs packets capture from
Observation Points in the network devices, e.g., line card or interfaces of packet forwarding devices. Before starting to send the packets to the Data Collector, the packets can be preprocessed, trough sampling and filtering rules.  [34]. For instance, the model can learn adaptively according to the traffic change, and find the optimal configuration to routing HHs and, thus, avoid congestion.

Heavy-Hitters Data Acquisition Module
HH-DAM is responsible for performing two tasks. Firstly, it captures packets from some observation point in the network devices. That means a monitoring task needs to be carried out to collect and report network states in the data plane. Secondly, it generates a dataset for HH identification from the collected packets. In HH-DAM, since we do not implement the monitoring task, we build up the HH-identification dataset from a publicly accessible traffic trace collected in a university DCN, named UNIV1 [64]. UNIV1 was processed and organised into flow records by using the flowRecorder tool with an expiration time (f ito ) set to 15 s and 150 s. We wrote this tool in Python to turn IP packets, either in the form of PCAP (Packet CAPture) files or sniffed live from a network interface, into flow records that are stored in a CSV (Comma-Separated Values) file. Our flowRecorder supports the measurement of flow features in both unidirectional and bidirectional modes. Depending on the properties of the observed (incoming) packets, either new flow records were created or the flow features of existing ones were updated. Tables 3 and 4 show the UNIV1 flows dataset size distributions obtained using flowRecorder with f ito = 15 s and f ito = 15 s. To get details and use this tool, we invite the reader to review [25]. Regarding the monitoring task, it is important to highlight that to avoid the SDN controller overload regarding traffic and processing caused by gathering statistics from a central point and aiming at getting better performance in collecting and reporting the network flows state, we plan in a future work to use Inband Network Telemetry (INT) [65] by using the Programmable, Protocol-independent Packet Processor (P4). P4 was created as a common language to describe how packets should be processed by all manner of programmable packet-processing targets, from general-purpose CPUs, NPUs, FPGAs, and high-performance programmable ASICs [59]. The main advantages of P4 are: (i) protocol independence meaning that collecting network devices information should be protocol agnostic, (ii) target independence indicating programmers should be able to describe packet-processing functionality independently of the underlying hardware; and (iii) reconfigurability in the field highlighting programmers should be able to change the way switches process packets after their deployment in the network.

Heavy-Hitters Data Acquisition Module
HH-DANM is in charge of storing and generating a network traffic state model targeted to identify HHs smartly. To carry out this module, we performed a clustering-based analysis, on the UNIV1 dataset, targeted to determine the optimal threshold that would separate the flows into HHs and non-HHs. We performed this analysis since there is no generally accepted and widely recognised uniform threshold for HHs detection. Indeed, different works use different thresholds without detailed or systematic justification. Some examples of such unjustified citation chains include Xu et al. [ [64]. Cluster analysis belongs to unsupervised ML techniques. It examines unlabelled data by either constructing a hierarchical structure or forming a set of groups. Data points belonging to the same cluster exhibit features with similar characteristics. In HH-DANM, we have decided to use K-means mainly because of its simplicity, speed, and accuracy. In addition, several research works report on its high efficacy when deployed on network traffic data [50].
There are several methods to determine the optimal number of clusters that K-means needs to operate, such as V-measure, Adjusted rank Index, V-score, and Homogeneity [72]. However, these methods are usually used with labelled datasets. Since our datasets are not labelled, we have decided to use the Silhouette score that does not require labelled data. In addition, the Silhouette method was also shown to be an effective approach for determining the number of clusters in data as well as for validation by Bishop [36] and Estrada-Solano [73]. In HH-DANM, we applied the Silhouette method by varying k from 2 to 15. This method uses a coefficient (S i ) that measures how well a point is clustered and estimates the average distance between clusters. The values of S i range between -1 and 1. The closer the values are to 1, the better the points are clustered. As Figure 4 shows, in our analysis the S i range is quite wide, with values that vary between 0.6 and the maximum, 0.99. In both f ito = 15 s and f ito = 150 s, the possible optimal k is between k = 2 to k = 11.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 20 validation by Bishop [36] and Estrada-Solano [73]. In HH-DANM, we applied the Silhouette method by varying k from 2 to 15. This method uses a coefficient (Si) that measures how well a point is clustered and estimates the average distance between clusters. The values of Si range between -1 and 1. The closer the values are to 1, the better the points are clustered. As Figure 4 shows, in our analysis the Si range is quite wide, with values that vary between 0.6 and the maximum, 0.99. In both fito = 15 s and fito = 150 s, the possible optimal k is between k = 2 to k = 11. Both UNIV1 with fito = 15 s and fito = 150 s, got the highest Si for k = 2. With k = 2, there is a significant distance between classes, as Figure 5b and Figure 5f show by the Principal Component Analysis (PCA) that performs linear transformations process resulting in primary components [36]. However, despite the promising Si value and the evident distance between the classes, the imbalance between clusters is also evident, i.e., one cluster contains a large number of flows while the other, extremely few. The specific numbers of this distribution are provided in Table 5. This distribution can produce an important deterioration of the classifier performance in particular with patterns belonging to the less represented classes.  Both UNIV1 with f ito = 15 s and f ito = 150 s, got the highest S i for k = 2. With k = 2, there is a significant distance between classes, as Figure 5b,f shows by the Principal Component Analysis (PCA) that performs linear transformations process resulting in primary components [36]. However, despite the promising S i value and the evident distance between the classes, the imbalance between clusters is also evident, i.e., one cluster contains a large number of flows while the other, extremely few. The specific numbers of this distribution are provided in Table 5. This distribution can produce an important deterioration of the classifier performance in particular with patterns belonging to the less represented classes.  Another promising S i value is for k = 5, in which for both datasets it is almost 1. We applied K-means clustering with k = 5. Figure 5c,g show PCA with k = 5. Our analysis focuses on the relationship between statistical features of the flows, in particular, their size, number of packets, and duration. Table 5 summarises the relationship between them. While k= 5 provides better results than clustering with k = 2, in terms of flow distribution, the imbalance is still noticeable. The results show class I as the most dominant class while class II to V seem to be outliers. This distance is evident from both Figure 5c,g as well as in Table 6. Table 6. Flows size and number of packets obtained using k = 5 and f ito = 150 s and f ito = 15 s.

Class UNIV1
Numb. We also performed the cluster analysis using k = 10. The new clusters seem to appear by splitting the old classes, in particular classes II to V. However, as Figure 5d,h show, the flows belonging to class I seems to keep the same shape and number of flows. This motivated us to analyse class I in detail. The analysis of flows belonging to class I followed the same steps that were applied previously. We have decided to use k = 5. Table 7 summarises the results that were obtained. Overall, the results show there is no clear threshold that separates flows into HHs and non-HHs. This is because the flow sizes have a diverse character that leads to more than two natural clusters. We stress that the threshold selection should include a detailed analysis of the network and its traffic. However, a pattern in the clusters I, II, IV, and V regarding flow size and the number of packets is evident, as Table 7 shows. Therefore, we suggest the use of the following thresholds for HH identification in traffic similar to UNIV1, Duque-Torres et al [74]: flow size θ s = 7 KB and packet count θ pkt = 14. Aiming at corroborating the applicability of our proposal, we evaluated the time that HH-DANM spends to identify the flows in UNIV1 by using the above-defined thresholds. As such, we analysed the time required to get 14 packets and an accuracy of over 96%. Table 8 summarizes the results obtained, provides statistical information about the flow duration for the first 14 packets, and reveals diverse facts. Firstly, the majority of flows can be identified in a short time. In particular, this time is less than 0.9 s for 80% of flows for f ito = 15 s and f ito = 150 s. Secondly, approximately 95% of flows are classified in a time less than 6 s for f ito = 15 s and f ito = 150 s. Thirdly, roughly, for f ito = 15 s and f ito = 150 s, only 4% of flows are identified in a time higher than 16 s. Fourthly, approximately, for f ito = 15 s and f ito = 150 s, just 2% of flows are identified in a time higher than 23 s. Finally, some flows (worst cases) required up to 480 s to collect the required minimum volume of flow data and/or packets. Considering these facts, we argue that our proposal is applicable in real scenarios. Furthermore, it is possible to get less time by decreasing the accuracy; it is by establishing a trade-off between accuracy and identification time. Finally, in the evaluation of HH-DANM, we compared its True Positive Rate (TPR) with the provided by classification techniques used by Poupart et al. [75]. In particular, they used Neural Networks (NN), Gaussian Processes Regression (GPR), and Online Bayesian Moment Matching (oBMM) to classify HHs. Table 9 shows the comparison results, revealing that HH-DANM (using the suggested values, flow size θ s = 7 KB and packet count θ pkt = 14 [76]) achieves similar results to the approaches based on NN, GPR, and oBMM. However, there are some significant considerations. Firstly, unlike GPR and NN based approaches that do not hold their performance when the threshold changes, our approach achieves the same performance regardless of thresholds. Secondly, the approaches based on NN and oBMM tend to be affected by class imbalances more than GPR-based ones, which explains why their accuracy often suffers as the classification threshold increases. In HH-DANM, the class imbalance is not a concern since it uses the same number of packets to identify any flow.

Heavy-Hitters Application Module
HH-APM is responsible for sending instructions about what the network (i.e., the SDN controller) needs to do, when a HH or non-HH is identified. Once the HHs are identified, the Controller instructs the forwarder devices to do a variety of network managerial related activities aiming at improving the overall network performance. To carry out HH-APM and routing the non-HHs identified, we propose the use of MiceDCER (Mice Data Center Efficient Routing) [35].
MiceDCER is an algorithm, proposed by our research group in a previous work [35], that efficiently routes non-HHs in SDDCN by assigning internal Pseudo-MAC (PMAC) addresses to the ToR switches and hosts. MiceDCER generates the PMAC of the switches that received the flow intercepted by the controller based on their position in the topology. PMACs are stored in a table that associates them with the corresponding actual MAC (Media Access Control) address of each switch. Also, MiceDCER reduces the number of routing rules by installing wildcard rules based on the information carried by the ARP packets. In a high-abstraction level, MiceDCER-based HH-APM performs three significant procedures: Generation of initial rules for the edge switches, Intercepted message management, and Generation of table entries. Figure 6 shows the flowcharts of each procedure performed by MiceDCER. The Generation of initial rules for the edge switches is carried out by installing routing rules for the edge switches with the ARP field type. These initial rules allow the controller to intercept the ARP messages that arrive at the switch. To install rules on the switch tables, the algorithm performs the procedure Intercepted message management. In this procedure, a verification is carried out about the intercepted IP destination address of the ARP request message. If the controller does not recognise such address, then the controller instructs the switch that received the message intercepted to flood and instructs the other edge switches to flood with requests. If the controller recognises the IP address, it sends a reply ARP message back to the source host. The Generation of table entries procedure has two major processes. Firstly, it verifies if the source IP address of the intercepted message is stored in the host PMACs table. Secondly, if the source IP (Internet Protocol) address is not stored, then this procedure proceeds to generate the PMAC and save it into the table, associating it with the source IP address. Finally, it is important to mention that MiceDCER asks the controller for updating the defined rules and generating new PMACs when topology modifications occur in the network.
Aiming at evaluating our MiceDCER-based HH-APM, MiceDCER was compared with IP-based and MAC-based routing. The evaluation was performed in a FatTree topology that consists of p pods, which are management units interconnected by the (p/2) 2 switches that make up the core layer. Each pod consists of p/ edge switches and p/2 aggregate switches, and each ToR switch, connects to p/2 end hosts [63]. Therefore, a p -ary FatTree topology supports p 3 /4 hosts. Table 10 summarizes the number of rules generated by MiceDCER, IP-based and MAC-based routing when p is 16, 20, 24, 28, 32, 36, 40, and 48. The evaluation results reveal that, in the edge layer, MAC-based and IP-based routing install more rules than our HH-APM running MiceDCER. Considering these results, we can conclude that MiceDCER reduces the number of rules per edge switch significantly when compared with other routing solutions. The results also reveal that, in the aggregate layer, MAC-based routing installs much more rules than MiceDCER. The IP-based routing installs approximately the double of rules than MiceDCER does. Thus, we can conclude that The Generation of table entries procedure has two major processes. Firstly, it verifies if the source IP address of the intercepted message is stored in the host PMACs table. Secondly, if the source IP (Internet Protocol) address is not stored, then this procedure proceeds to generate the PMAC and save it into the table, associating it with the source IP address. Finally, it is important to mention that MiceDCER asks the controller for updating the defined rules and generating new PMACs when topology modifications occur in the network.
Aiming at evaluating our MiceDCER-based HH-APM, MiceDCER was compared with IP-based and MAC-based routing. The evaluation was performed in a FatTree topology that consists of p pods, which are management units interconnected by the (p/2) 2 switches that make up the core layer. Each pod consists of p/ edge switches and p/2 aggregate switches, and each ToR switch, connects to p/2 end hosts [63]. Therefore, a p -ary FatTree topology supports p 3 /4 hosts. Table 10 summarizes the number of rules generated by MiceDCER, IP-based and MAC-based routing when p is 16, 20, 24, 28, 32, 36, 40, and 48. The evaluation results reveal that, in the edge layer, MAC-based and IP-based routing install more rules than our HH-APM running MiceDCER. Considering these results, we can conclude that MiceDCER reduces the number of rules per edge switch significantly when compared with other routing solutions. The results also reveal that, in the aggregate layer, MAC-based routing installs much more rules than MiceDCER. The IP-based routing installs approximately the double of rules than MiceDCER does. Thus, we can conclude that MiceDCER reduces the number of rules per aggregate switch significantly when compared with the MAC-based and IP-based routing. In the core switch layer, the results expose again that MAC-based routing installs more rules than MiceDCER. In turn, the IP-based routing installs about the same amount of rules as MiceDCER. We can conclude that MiceDCER reduces or at least generates the same number of routing rules to install in the core switches. For more information about MiceDCER, its design, implementation, and evaluation, we refer the reader to Amezquita-Suarez [35].

Conclusions
Considering the current trends in networking, a let-up in data expansion is unlikely. On the contrary, the changes in the traffic patterns will be amplified. To adequately support the demand and scale of the continuously increasing workloads, as well as business flexibility and agility, heavy-hitter traffic flow classification will continue to play a key role. However, as we discussed in the previous sections, HH detection in current SDDCN environment is still challenging, especially concerning traffic flow statistics collection and threshold estimation. Aiming at overcoming such challenges, we proposed a novel HHs identification approach based on the KDN concept. This concept takes advantage of ML techniques in SDN. In particular, this paper, firstly, presents a clear understanding of KDN concept. Secondly, it introduces the approach proposed for Heavy-Hitters identification as well as details and evaluates its modules. In the Heavy-Hitters Data Analyser module, we performed a cluster analysis using K-means for clustering the flows. We employed Silhouette analysis to determine the optimal number of clusters. Based on the obtained results, there is no single consistent threshold that separates flows into HHs and non-HHs. The flow sizes have a diverse character that leads to more than two natural clusters. We stress that threshold selection must include a detailed analysis of the network and its kind of traffic. In the Heavy-Hitters Application module, we present MiceDCER, an algorithm that efficiently routes non-HHs by assigning internal PMAC addresses to the edge switches and hosts. Our evaluation reveals MiceDCER significantly reduces the number of rules installed in switches and, therefore, contributes to reducing the delay in SDDCN. To sum up, the per-module evaluation results corroborated the usefulness and feasibility of our approach for identifying HHs.
As future work, we intend to perform non-threshold-based HHs identification. In this sense, we plan to offer a solution based on a per-flow packet size distribution for predicting in an early flow stage if it will be a HH.
Author Contributions: The authors contributed equally to this manuscript.

Funding:
The authors would like to thank the Universidad del Cauca, Fundación Universitaria de Popayán, and Universidad del Quindío.