Efficiency of Supervised Machine Learning Algorithms in Regular and Encrypted VoIP Classification within NFV Environment

. Cloudification of all computing environments is an undergoing process. The process has overpassed the classical Virtual Machines (VM) and Software-Defined Networking (SDN) approach and has moved towards dockizing, microservices, app functions, network functions etc. 5G penetration is another trend, and it is built on such platforms. In this environment we are investigating the efficiency of supervised machine learning algorithms for classification of regular and encrypted Voice over IP (VoIP) traffic that 5G relies on, within a virtualized Network Functions Virtualization (NFV) environment and an east-west based network traffic. We are using statistical methods for classification of network packets without the need of inspecting the payload data and without the source, destination and port information of the packets. The efficiency is analyzed from a point of precision of the classification, but also from a point of time consumption, as adding delay to the original traffic may cause a problem, especially within 5G environments where packet delay is crucial.


Introduction
VoIP communication is established as one of the most common services used by the general population. Multiple systems and applications are providing peer-to-peer VoIP services, both as an unencrypted and encrypted traffic. As the systems are migrating towards cloud virtualization, using public (e.g. Amazon's AWS, Microsoft's Azure, Google Cloud), mixed or private clouds, the classification and deep packet inspection (DPI) of traffic is a standard requirement for any organization. It allows implementation of high security within the network, application and data monitoring, as well as network optimization and provision of Quality of Service (QoS). The trends of machines and appliances virtualization have moved towards using docker instances, microservices, application functions and networking functions [1]. 5G technology is relying on these services and its data within the cloud is moving in the eastwest direction, not leaving the virtual layer. East-west traffic means that the network data is moving inside the datacenter, usually managed by the cloud operator, between the servers and within the virtual layer. Applications such as Skype and Viber (and many more) are using encryption which makes DPI even more problematic. The latency of the network packets must be minimal [2].
Due to the previous, the paper is exploring the efficiency of the already established supervised machine learning algorithms within a scenario, where classification and speed of the algorithm are equally important. The challenge is to monitor and capture the "hidden" east-west traffic [3], where NFV elements are already operating and to classify it, but also minimize the expected latency added by the algorithm for classification. 5G networks are using VoIP as a standard and the 5G specification calls for user plane latency of just 1 ms for ultra-reliable low-latency communications (URLLC) [2]. The study that we have conducted provides a novel scenario that resembles the target architecture of the systems, which will be built upon 5G and NFV elements, as well as in terms of a variety of ML algorithms that we test and the DPI efficiency evaluation. It has an importance from functional, security, controlling, QoS and management aspects.
There are multiple works that are using machine learning approach for packet inspection [4], [5], [6], [7]. ML algorithms have been used in traditional networks DPI for a long time and their behavior is proven by many studies [8], [9], [10], as well as in practice. Best to our knowledge, there is no research that is focused on encrypted and unencrypted VoIP within NFV environment from the point of precision and speed of the supervised ML algorithms.
In particular, our paper focuses on 6 different ML algorithms: Bayes Net, Naïve Bayes, J48 (as an implementation of C4.5), K-Nearest Neighbors (K-NN), Decision Tree and AdaBoost. The classifications are made with Weka [10]. These algorithms are chosen due to the fact that they are the most common ML algorithms used in traditional networks, and are proven to be reliable in practice. Due to the encryption mechanisms used for VoIP network traffic and the specifics of the east-west traffic, using traditional DPI mechanisms is impossible. In our analysis we are not using the payload data of the network packets, the communication ports, IP addresses and MAC addresses of the source and destination entities. We are working with the statistical features of the packets and the packet flows in general, to create a training set for the ML algorithms. After the training, we are testing the precision, as well as the speed of the algorithms on the testing set. We have created a testing environment in which the traffic is sniffed inside an open vSwitch, directly listening on the east-west traffic without introducing an external probe or an SDN device that will collect the data. All network traffic is observed as a whole, including the communication among network elements, as well as management networking data, because this is a realistic scenario in practice. VoIP (both encrypted and unencrypted) is recognized successfully under these conditions. Because ML algorithms are consuming CPU within the virtualized environments and are adding to the packet latency, the speed of the ML algorithms is very important aspect to the overall efficiency.
The reminder of the paper is organized as follows: we briefly go through the related work on the subject in Sec. 2, after which the experimental setup and the dataset creation is explained in Sec. 3. The results analysis and conclusion follow in Sec. 4 and Sec. 5, respectively. The future work is at the end.

Related Work
Deep Packet Inspection is a service that is crucial in digital environments. As 5G ambition is to unify many services through one platform thus providing basis for further development of systems and applications, it is of high importance to have a valid DPI that will provide viable results and in the same time that will not increase the network latency. There are researches that are focused on DPI in SDN [11], [12], [13], while others are focused on the security aspects of DPI [14], [15], often proposing an introduction of probes or SDN appliances within the network. Network traffic classification in traditional networks is researched in the works of [16] and [17], but they do not consider the use of ML algorithms in NFV environment.
The authors of [18] propose a design of virtual network functions to flexibly select and apply the best suitable machine learning classifiers at run time. They analyze multiple ML algorithms, such as K-Nearest Neighbor, Support Vector Machine, Decision Tree, Ada-Boost, Naive Bayes and Multi-Layer perception. The experimental re-sults show an improvement of 13% in the accuracy of the flow classification using the proposed NFV.
Vergara-Reyes et al. [4] introduced an NFV environment in which different types of TCP traffic are generated. Network packets are captured and analyzed using three different ML algorithms: J48, Naive Bayes and Bayes Net, to provide a benchmark on the performance of the algorithms. Statistical parameters of the individual packets are taken into consideration to prepare the training and testing sets for the ML algorithms. Three different datasets are created: traditional, virtual and combined to better characterize the traffic in the NFV based networks. On the other hand, in our research we are working with statistical parameters of packet flows within an NFV environment typical for cloud platforms, focusing on both encrypted and unencrypted VoIP traffic.
Alshammari et al. [5] covers the DPI of VoIP traffic within traditional networks using real data from existing network environments with different topology (with and without firewalls) and different access methods (WiFi or Ethernet), to evaluate the precision of three ML algorithms: C5.0, ADA Boost and GP Classifier. Subset sampling technique and statistical analysis test for precision and false positive rates is performed for evaluation of the ML algorithms. Their research has shown that C5.0 achieved the highest performance with the highest precision and the lowest false positive rate. Our work is focused on cloudbased environment with an accent on an NFV and unsupervised ML algorithms.
The work [19] proposes a runtime predictive analysis system that runs in parallel with the existing reactive monitoring systems within a network operator. Deep learningbased approach is used to identify anomaly events from NFV system logs, in order to identify faulty conditions and to take necessary pro-active actions within the network.
In [20] Machine Learning based classification of multi-service internet traffic is evaluated from the point of resource consumption in terms of CPU and memory consumption. Out paper is complementing this research as we are observing the time needed for various ML algorithms to perform the classification job.
The authors of [21] are researching the effect of NFV elements placement on the network traffic, especially on the increase or decrease of the volume of the processed traffic. An algorithm that determines the flow path and then proposes a Least-First-Greatest-Last routing is developed.
The work of Bonfiglio et al. [22] is on Skype and its generated traffic specifics regardless of the underlying architecture. It deals with two different approaches for revealing Skype encrypted traffic in real time, based on the statistical parameters of the generated packets. DPI and flow correlation are used to assess the effectiveness of the proposed approaches.
In [23], ML algorithms, big data analytics platforms, SDN and NFV elements are used to build a comprehensive framework for developing future 5G Self-Organizing Network (SON) applications, as well as a framework for clustering, forecasting, and managing traffic behaviors for a huge number of base stations with different statistical traffic characteristics of different types of cells (GSM, 3G, 4G). Traffic flows are analyzed and SDN-based QoS control is implemented to enable bandwidth guarantees for each application. In this case study, 5 different ML algorithms are used to classify accurately mobile applications. Different types of encrypted traffic were used for classification, showing that Random Forrest algorithm has the best overall performance relative to the others tested algorithms. Compared to this work, we focus on VoIP and ML algorithm performance in terms of accuracy and speed in an environment, where NFV elements are deployed, a scenario feasible for future 5G development.
In general, our work is introducing similar testing setup as [4] and [23], adding new elements in the testing environment, like virtual hosts with internet access, from which VoIP is generated. Encrypted and unencrypted UDP based VoIP traffic, along with various random TCP and UDP traffic is generated and classified, using not only the packets themselves, but also the packet flows statistics. Skype and Viber are chosen as most widely used peer-topeer VoIP clients that engage encryption. Furthermore, the paper proposes a novel testbed setup in the context of 5G that is built upon virtualized environment in which NFV elements are used. This is an expected setup for the systems that work in the virtualized plane and are using NFV and 5G communication. The data is analyzed directly into the network data flow between the NFV elements, without introduction of physical or SDN probes. As shown in the next section, both precision and speed of six different algorithms is evaluated in terms of which algorithm performs the best within the target scenario.

Experimental Environment and Dataset Creation
To simulate the east-west traffic within a virtualized NFV based network, we have created an experimental environment in which Oracle VirtualBox [24] is installed on Ubuntu 18.04 Server single physical host. Open vSwitch (OVS) [25], [26] is installed on the host for network communication, allowing to intercept and sniff all network traffic going through it. It allows to capture all transferring network packets. We are using Wireshark and tshark [27] for network capture. Figure 1 shows the experimental environment that is used. Mininet [28] network simulator on 2 different VMs is used to create 2 networks with multiple hosts, switches and links among them. The networks have private IP addresses and are able to communicate with each other using GRE tunneling, within the OVS. Some of the simulated hosts are NAT-ed and have internet access. Simulated networks are controlled using Ryu Controller [29] in a dedicated controller VM. To generate TCP and UDP traffic, a Distributed Internet Traffic Generator (D-ITG) [30] is used within the hosts created into the Mininet networks. D-ITG produces traffic at packet level, replicating appropriate stochastic processes for both IDT (Inter Departure Time) and PS (Packet Size) random variables.
Three additional VMs, also connected to the OVS, with Skype and Viber clients installed on them, simulate the encrypted UDP VoIP traffic in a peer-to-peer communication. Random audio calls are performed among clients with random duration.
The traffic goes inside the OVS in an east-west direction. Internet is needed for initial contact to Skype and Viber servers, after which the communication is entirely inside the OVS in the east-west direction.
We have simulated 50 different network traffic scenarios generating TCP and UDP streams using D-ITG. Encrypted VoIP was generated using Skype and Viber clients. To classify the network traffic, 3 labels were used: VOIP -for unencrypted VoIP, EncVOIP -for encrypted VoIP and OTHER -for all other network packets.
The experiments were conducted in an interval from 4 to 20 minutes. Every experiment produced one dataset. In every experiment, different network scenario was simulated, with different Mininet hosts used to generate and to receive the network traffic. The average length of an experiment was 625 sec. The average number of packets was 1.262.375 and the average number of flows was 4090. Skype and Viber calls were conducted randomly with a length form 10 seconds to 3 minutes.
When traffic is observed, it can be seen that there are multiple flows within the OVS, from the inter-virtual hosts traffic and from the management traffic generated by the hypervisor and the controller. The features that are not valid in NFV environment, as well as within encryption scenarios, are not used. Such features are the source and destination IP and MAC addresses, and the communication port which are not distinguishable when encryption is used. To generate the datasets, we have chosen creation of flowbased data sets. Similar to [5], we define a flow as a bidirectional connection between two hosts. TCP flows are ended either by flow time-out or by connection tear-down, Through observation of the traffic and from experience within traditional networks, the flow features explained in Tab. 1 were selected as attributes for the characterization of the flows. The main goal is to classify the network packet flows based on the statistical characteristic of these attributes, and to make the classification fast enough, so that minimal delay of the packet is introduced.
Weka [10], [32] was used for the processes of training and testing of the prepared datasets. We were using 2/3 vs 1/3 split method on each of the datasets for training vs testing set, accordingly. To select the relevant attributes within the dataset, we used the AttributeSelectedClassifier with Ranker as a search method and InfoGainAttributeEval as an evaluator that determines the gain of information that the features carry with the respect to our classification. In such a way, only attributes that carry more information are selected, which reduces the entropy in the dataset. With the prepared datasets, training and testing of the above-mentioned ML algorithms was performed.
The next section explains the experimental results and analyzes their meaning.

Results and Analysis
As explained in the previous section, we have prepared 50 experimental datasets and we have tested 6 different ML algorithms onto them. The final performance metrics are the mean value and the statistical standard deviation of the algorithms precision in the classification, but also the True Positive Rate (TP Rate) and False Positive Rate (FP Rate) of the classification, which combined give the classification performance of the algorithms.
True Positive TP is the number of instances that are truly identified of a class.
False Positive FP is the number of instances that are falsely identified of a class.
True Negative TN is the number of instances that are truly identified that are not of a class.
False Negative FN is the number of instances that are falsely identified that are not of a class.
Precision of the algorithm [32] is defined as the proportion of instances that are correctly identified in a class, divided by the total instances classified as that class.

Precision = TP/(TP + FP).
(1) TP Rate is the proportion of the instances that are correctly classified and the total instances truly of that class.

TP Rate = TP/(FN + TP).
(2) FP Rate is the proportion of the instances that are wrongly classified and the total instances truly of that class.

FP Rate = FP/(FP + TN).
( The results are visually represented in Fig. 2. The best classification is performed by the algorithms that have higher Precision and TP Rate, and lower FP Rate. Figures  3, 4 and 5 show them individually for the ML algorithms in focus. Table 2 shows the results for the 6 ML algorithms. Mean value and statistical standard deviation of the Precision, TP Rate and FP Rate for the three classes in the 50 datasets were calculated.
When comparing the results for Encrypted VoIP and VoIP traffic, one can see that Decision Tree and Bayes Net algorithms show the best results, with the highest precision, high TP rate and low FP rate, followed by J48 and K-Nearest Neighbor. Naive Bayes and AdaBoost performance are not so good especially in the False Positive Rate that shows us that those algorithms are classifying other traffic as VoIP and EncVoIP.
The comparison of the mean values of the precision in Tab. 2 shows that Bayes Net has the greatest overall precision, 1.35% higher than J48 and 1.65% higher than the Decision Tree ML Algorithm. But in the same time Decision Tree has 0.24% better TP Rate than Bayes Net and 0.73% better than J48. Even more important, Decision Tree has the lowest FP Rate, which is for 1.54% lower than both of them.
AdaBoost performs the worst in terms of the FP rate, with 87.7% higher value than Decision Tree. Naive Bayes has 51.2% higher FP rate than Decision Tree.
K-Nearest Neighbor algorithm is in the middle with 4.78% lower precision, 1.25% lower TP rate and 4.2% higher FP rate than Decision Tree.
The percentages have been calculated on the mean values of the precision, TP Rate and FP Rate of the three classes explored.
The second characteristic that is important for the overall efficiency is the time interval needed for the algorithms to perform the classification. The two metrics combined will give the whole picture needed to evaluate the algorithms. The time in our case is relative to our experimental environment, but the comparison is relevant due to the same experimental environment under which all measurements have been done and the same datasets used for     every algorithm. Using faster or multiple machines for classification can significantly speed up the time needed for evaluation, but the ratio for the comparison of the algorithms is expected to stay the same. Due to this "resource spending" of the algorithms, efficiency from the point of time needed for classification is very important. Table 3 shows that AdaBoost algorithm is the fastest, taking only 0.8% of the time spent by the K-Nearest Neighbor. Due to the poor classification performance of AdaBoost, the second fastest algorithm -Decision Tree has the best overall efficiency for classifying encrypted and un-encrypted east-west based VoIP traffic within NFV based environment, taking only 1.66% of the time needed by K-NN. It is followed by Bayes Net, as well as J48 that also have good average time required to perform the classification. Because these two algorithms are also performing well within the classification, their overall performance is satisfactory. This is visually interpreted in Fig. 6.
K-Nearest Neighbor algorithm has good classification performance, but the time needed for classification is very high. We need to mention that the results shown here were using 1 nearest neighbor, but experiments using more (2 and 3) nearest neighbors have shown similar or worst performance in the experiments.
Naive Bayes has shown a poor classification performance and when compared relatively to the other algorithms, it requires a longer time for classification.
In context of 5G strict requirements for low latency, closer observation of the results shows that KNN and Naive  Bayes need more time for classification than the rest of the ML algorithms. AdaBoost is performing badly in the terms of False Positive instances, while the others have satisfying classification performance, while the speed is the decisive element for their usage within a 5G scenario.

Conclusion and Future Work
The paper aims to compare the efficiency of six different supervised machine learning algorithms in classifying VoIP and encrypted VoIP network traffic in a situation where the network traffic is flowing inside a virtualized environment where NFV elements are placed. This scenario has two main boundaries: 1. intercepting the network traffic inside the virtual layer without the need to introduce additional external network probes or SDN elements that would convert the east-west traffic into north-south traffic; 2. making the classification of the traffic with minimal consumption of resources that would increase the latency of the packets.
Due to this, we have defined the efficiency of each algorithm as an optimal balance between the classification performance and the time consumed by it. We have built an experimental environment and we have conducted multiple tests, generating various network traffic sets from which we have extracted the network data flows. The most relevant statistical features of the flows were selected as the attributes of the datasets. The source and destination IPs and MAC addresses, as well as the communication ports were not taken into consideration because they are not relevant in a virtualized scenario in which encryption is applied.
The results reveal that Decision Tree and Bayes Net algorithms have the best efficiency, with J48 following just behind them. K-Nearest Neighbor (with k = 1) has shown good classification results, but has spent more time for performing the classification. Naive Bayes and AdaBoost have good classification speeds, but have large False Positive classification performance for the VoIP traffic.
The benefits of this analysis are in its practical usage in systems that are highly built on cloud platform, where NFV elements are an integral part of the solution. 5G connectivity to such systems is likely to be used and even 5G own infrastructure is relying on cloud services. Efficient network traffic classification in order to establish the VoIP traffic is a necessity for enabling QoS, security of data, network and application management, monitoring and control.
For future work we are planning to expand our experimental environment to larger virtualized environments with multiple hosts, based on various platforms (Hyper-V, VMWare, XenServer), as well as validating the tests in a real TelCo environment.