Edge Computing Intelligence Using Robust Feature Selection for Network Traffic Classification in Internet-of-Things

Internet-of-Things (IoT) devices are massively interconnected, which generates a massive amount of network traffic. The concept of edge computing brings a new paradigm to monitor and manage network traffic at the network’s edge. Network traffic classification is a critical task to monitor and identify Internet traffic. Recent traffic classification works suggested using statistical flow features to classify network traffic accurately using machine learning techniques. The selected classification features must be stable and can work across different spatial and temporal heterogeneity. This paper proposes a feature selection mechanism called Ensemble Weight Approach (EWA) for selecting significant features for Internet traffic classification based on multi-criterion ranking and selection mechanisms. Extensive simulations have been conducted using publicly-available traces from the University of Cambridge. The simulation results demonstrate that EWA is capable of identifying stable features subset for Internet traffic identification. EWA-selected features improve the mean accuracy up to 1.3% and reduce RMSE using fewer features than other feature selection methods. The smaller number of features directly contributes to shorter classification time. Furthermore, the selected features can train stable traffic classification generative models irrespective of the dataset’s spatial and temporal differences, with consistent accuracy up to 97%. The overall performance indicates that EWA-selected statistical flow features can improve the overall traffic classification.


I. INTRODUCTION
The introduction of the Internet-of-Things (IoT) has benefited numerous sectors like healthcare, manufacturing, finance, and entertainment. The massive IoT devices' interconnectivity raises serious concerns since it resulted in high network traffic. Monitoring and managing network traffic, especially at the network's edge, requires an accurate and efficient network traffic classification. One of the factors for efficient and accurate network traffic classification is the The associate editor coordinating the review of this manuscript and approving it for publication was Sherali Zeadally . selected classification features that are stable and can work across different spatial and temporal heterogeneity.
Traffic application identification is a fundamental and critical task in network traffic management [1]. The limitation of port-based e.g. [2]- [4] and payload-based strategies e.g. [3], [5]- [8] prompts the use of statistical flow features e.g. [9], [10] for traffic classification. The latter provides the pliability to identify network traffic in contrast to port-based and signature-based strategies since this type of traffic identifier is not affected by detection avoidance mechanisms such as non-static port numbers and payload encryption.
Identifying the classes of Internet traffic using statistical flow features is non-trivial because of the high dimensionality VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of traffic features used for traffic classification. Preferably, the usage of many features would boost the ability to differentiate Internet traffic [11], [12]. Nonetheless, it is not always so in practice because not every feature is informative and useful. Some statistical flow features may not be relevant and uninformative, while others may have high inter-correlation with each other features and thus redundant [13], [14]. The use of less significant traffic features affects the efficiency and accuracy of network traffic classification [13], [15]- [17]. Different feature selection (FS) techniques have been suggested by different researchers and scholars [15], [17]- [20] to enhance classification performance and accuracy by discarding irrelevant attributes. Nevertheless, these studies did not consider the selected features' stability when applied in a situation with a different location and time heterogeneity. Moreover, for traffic identification at the network edge in real-time, a minimal number of features must be used to improve the classification throughput on edge devices such as middleboxes.
Thus, this work proposes a feature selection method for network traffic classification named Ensemble Weight Approach (EWA) for selecting robust statistical flow features for Internet traffic classification that are robust. The proposed feature selection method first generates candidate features using conventional feature selection methods, ranking each feature combination, and searching for the best features. Extensive simulations have been conducted using publicly-available traces from the University of Cambridge to evaluate the proposed EWA feature selection. EWA selects fewer features for machine learning classification of Internet traffic that are stable irrespective of the dataset's spatial and temporal differences, improving the overall traffic classification.
The remainder of this paper is organized as follows. Section II explains similar feature selection methods, particularly for network traffic classification. Section III presents the proposed FS method. Section IV describes the experimental setup, while Section V discusses the results. Section VI concludes the paper and recommends future works.

II. RELATED WORK
This section discusses similar feature selection methods, particularly for network traffic classification. We also present a comprehensive review of state-of-the-art feature selection techniques for network traffic classification.

A. ML TRAFFIC CLASSIFICATION
ML is one of the techniques that can be used for IoT. ML is a group of robust strategies for data mining and knowledge discovery [21], [22]. The first work using this technique was [23]. The conventional structure for creating ML models involves sampling the training dataset, extracting features, selecting informative features, and creating the generative model. Once the generative model has been generated, network traffic can be classified based on the preset classes defined during training.
Feature extraction is a method of extracting features that can distinguish a data class over the others. In the case of network traffic classification, distinct attributes such as port [4] and packet inter-arrival time and flow statistics [24] can be used as the classification features. However, the cardinality of possible features can be huge. While classifier training can be done offline, many features will result in a large generative model and require a big memory footprint. Furthermore, extracting a large number of features in real-time classification is not realistic. Hence, feature selection (FS) is required to boost both effectiveness and efficiency since it discards less informative or irrelevant features that benefit both the training and classification phases.

B. THE USE OF FEATURE SELECTION
In machine learning, FS is a commonly used technique in data preprocessing. FS methods aim to identify and choose a subset of features to describe the data concept effectively. Simultaneously, FS can reduce the effects of noise and unrelated attributes to yield a good prediction of data class [17], [25], [26]. Traffic identification can greatly benefit in terms of accuracy and other performance metrics by utilizing the most significant features [27]. The selection of relevant features for network traffic identification is non-trivial due to: • It requires a good understanding of the traffic engineering domain to identify which features are relevant.
• Datasets may contain uninformative features that considerably can reduce classification accuracy.
• Efficiency of the identifiers decreases when selecting a huge number of attributes. The storage requirement is increased, and time taken for training and testing of the model is also increases [28]. Recently, FS strategies are extensively deployed in many applications, such as identifying informative genes [29], bioinformatics [30], and text categorization [31]. The objectives of the algorithms used for extracting features may differ. However, they all have many similarities [32]: • To find the minimal size feature subgroup is fundamental and sufficient to the target concept [33].
• The ability to choose a subgroup of features from a large collection, in which the criterion value can be optimized over every subgroup [34].
• The right choice of subclass features to increase identification accuracy. Reducing the structure of chosen features and not tampering with the built model's prediction accuracy [35].
• Selecting a small group can result in class distribution given only values of the selected features, which can closely represent the original distribution [35]. Furthermore, FS process evaluation can be achieved with four basic stages: subset creation and assessment, termination criterion, and result validation [36]. The process starts with subclass creation employing a particular search approach to yield candidate feature subsets. Subsequently, every candidate subgroup is examined using specific examination conditions and related to the previous best result. The obtained result becomes the best result if it outperforms the previous best. The procedure for subset creation and examination continues until the termination condition is fulfilled. Lastly chosen best feature subgroup is authenticated by previous information or test data. The search approach and assessment condition are two vital factors for the study of FS.
Subset creation starts with a search point, that could be an empty set, whole set, or a randomly created subset. In the beginning, it can lookup feature subgroups from random directions. In the forward search, features are inserted individually, whereas in the backward search, the least significant feature is detached based on the valuation criterion. Random search includes or removes features in random to evade being trapped into a local maximum.

C. FEATURE SELECTION MODELS
FS processes can be divided to two main methods -filter and wrapper methods [37]- [39]. The filter approaches or feature ranking methods can use the wrapper approach to rank features too. A filter-based FS can return a subset of features, e.g., Correlation-based FS method (CFS). These techniques' attractive nature is centered on their simplicity, scalability, and good empirical success [14]. Feature ranking is effective because it involves only computation and sorting of scores. The subsets of the main features can be chosen based on feature ranks to create a classifier. Some filter techniques employ ranking conditions based on information-theoretic criteria including information gain (IG) [40], GainRatio (GR) [27], mutual information [14], and entropy-based measure [41], whereas some use statistics, such as Chi-squared statistics [42], T-statistics [43], F-statistics [44], MIT correlation [45], and Fisher criterion [46].
The wrapper approaches [47], [48] rely on identifying informative features for obtaining a feature subset. Wrappers exploit the performance learning machine to appraise the value of feature subgroups. The wrapper FS techniques can produce high identification accuracy for a specific identifier at the expense of high computational complexity and less generalization of the selected features on other identifiers. The wrapper techniques commonly surpass filter technique with regards to the accuracy of the learning machine, which could be categorized as sequential selection algorithms ((SFS), sequential backward (SBFS), and sequential forward floating selection (SFFS)) and heuristic search algorithms (genetic algorithm [49]).
The other group of FS is hybrid methods. Every feature evaluation measure (EM) is equipped with distinct advantages and disadvantages. Some hybrid procedure FS techniques include filter and wrapper [39], [50]. Lately, the hybrid approach has been widely explored for FS due to its global optimization abilities [51]. The hybrid method proposed in [29] applied rank, which grouping to associate various FS approaches. These features were combined using a weighted sum from every component rankings acquired from a distinct FS mechanism. This shows that a combination scheme performs better than individual FS techniques. Moreover, Rogati and Yang [52] demonstrated that the increase in performance was achieved by merging different feature selectors.
Moreover, all these methods can be represented in the space of features according to the evaluation measures (EM), generation of successor (GoS), and search organization standards. GoS and Search organization are grouped as generation procedure. These three categories are explained as follows.
• EM is a function applied to evaluate the generated successor.
• GoS is a technique that suggestes a successor of the current hypothesis. Several operators can be considered to generate a successor: Backward, Compound, Forward, Random, and Weighting.
• Search technique is applied to drive the FS process applying one of these techniques: sequential, exponential, or random strategy. Moore and Zuev [53] used the Fast Correlation Based Filter (FCBF) FS technique for feature reduction and Naive Bayes classifier to measure the significance of the feature reduction. The overall identification accuracy result based on subsets features is 84.06%, obtained by using all features. Jun et al. in [54] applied two subsets feature to create a identified traffic. The work employed subsets features of flow on Support Vector Machine (SVM). Training time was reported at 40 seconds, while the classifier accuracy is 70%. In [55] identified traffic using SVM and random search algorithm for feature reduction. The proposed method did not use UDP traffic, even though network traffic is composed of TCP and UDP packets.
Moore and Zuev [53] used the Fast Correlation Based Filter (FCBF) feature selection technique for feature reduction and Naive Bayes algorithm to measure the significance of the feature reduction. The overall classification accuracy result based on features subsets is 84.06%, obtained by using all features. Jun et al. [54] used two feature subsets to create a classified traffic. The work employed flow features subsets on Support Vector Machine (SVM). Training time was reported at 40 seconds, while the classifier accuracy is 70%. In [55] classified traffic using SVM and random search algorithm for features reduction. The proposed method did not use UDP traffic, even though network traffic is composed of TCP and UDP packets.
Zhang et al. [16] proposed WSU AUC and SRSF FS algorithms. WSU AUC was employed to choose features from high dimensional imbalanced data. This work used ten Cambridge datasets, UNIBS, and CAIDA datasets and applied the C4.5 decision tree and NBK machine learning algorithm (batch learning method) to evaluate proposed FS algorithms. This method computes the value of WSU on each feature and the classes and removes redundant features depending on the specific three-shot. This method also used the SRSF method to select the robust features that depend on frequency weight. This work selected three features are server port features, minimum segment size observed and the total number of bytes sent in the initial window, hence achieved an accuracy of more than 94%.

D. CHALLENGES IN FEATURE SELECTION FOR TRAFFIC CLASSIFICATION
The key challenge for selecting features is preserving the appropriate features subset for accurate traffic identification. Traffic classification accuracy is associated with a small number of appropriate features [13], [15]- [17]. Various FS methods select various sets of significant features, but they do not always select the same number of significant features. These are challenging due to: • Representative influence of a specific FS approach may limit its search space, which hinders achieving an optimal subset.
• Various FS approaches may produce feature subgroups that can be termed as local optimal in the space of subsets feature.
• A collective method can give an improved approximation to ranking or optimal subset of features, which is not frequently applicable with a single feature selection technique.
Moreover, a broad analysis is required to provide information or knowledge for the main factors affecting the robustness of the FS procedures. Al Harthi et al. [56] proposed an approach named global optimization algorithm (GOA) it was focused on the stability issues. This approach depends only on the frequency of the selected feature (ignore the robustness of the selected feature) and consider Round-Trip Time (RTT) features as part of selected subset features, which depend on location [57].Nevertheless, it would be ideal to ensure the robustness of feature subset (accurate regardless of location and time heterogeneity and selection of a small relevant number of features). This is important to build traffic identification.

III. PROPOSED ENSEMBLES WEIGHT AVERAGE (EWA) FEATURE SELECTION
The conceptual illustration of traffic classification is shown in Figure 1. This framework comprises the learning model that learns from the sampled datasets and the classifier model that classifies incoming traffic based on the learned classifier model. A traffic instance (flow or packet) is represented by various features that can measure varying aspects of such an feature. A flow refers to a group of packets sharing same 5-tuples (transport protocol, destination and source IP, destination and source port). Flow can be represented by TCP or UDP packets.
Generally, datasets (can be in the pcap format) are used as the classifier's training sample. Then, the FS selects the relevant feature subsets to the target protocol or application (in this case, network traffic classification). The learning model is then learned based on the selected feature subsets of all training instances. As previously mentioned, the hybrid method combines features based on a weighted sum from every component rankings acquired from a distinct FS mechanism. This approach is shown to perform better than individual FS techniques. The EWA method consists of three main stages: Evaluation of individual FS methods and feature pool generation, weighted ranking of features, and searching an optimal features subset, as shown in Figure 2. The first stage involves feature extraction and the formation of a feature pool from outputs of individual FS methods (wrapper and filter FS methods). The cutpoint of twenty features is used as the stopping criterion. The cutpoint value can be changed accordingly. Since the EWA aims to select the fewest possible traffic classification features, the cutpoint is set to twenty. In the second stage, the selected features are ranked, and features observed in different datasets will be given higher ranks. In the third stage, EWA applies one widely used sequential search strategy (SFS) (Sequential Forward Selection (SFS)) [58] to remove both irrelevant and redundant features from the initial selected features pool.

A. STAGE 1: FEATURE POOL GENERATION
This stage evaluates the stability of each feature subset generated by the respective FS technique. Each FS technique generates non-unique feature subsets when applied to the different training datasets. Note that a distinct FS technique uses a distinct method to create feature subsets. The selected features are then evaluated using ML classifier. Here Naive Bayes classifier is appiled to evaluate the accuracy of each dataset. Selecting optimal features across the different locality and time heterogeneity is difficult. Hence, to make the best of the various FS methods, EWA uses multi-feature selection methods on multiple datasets to create the initial pool of multiple feature subsets. Accuracy and Stability are used as the criterion to select the candidate FS methods. These selected feature subsets are used to create the initial features pool. Unselected features by any of the FS techniques are removed.
Assume a set of training datasets, D = {D 1 , D 2 , . . . , D |D| }, k is the number of candidate FS methods FS = {FS 1 , FS 2 , . . . , FS k }, and F = {f 1 , f 2 , . . . , f |F| } is the potential features that can be used for traffic classification. Moore et al. proposed 248 porential features that be used for network traffic classification [19]. Let P pool = ∅ be the initial features pool and P k is the the X best ranking features for FS k , where X is features cutpoint. Each P k is evaluated using a ML classifier, in this work the Naive Bayes classifier, to evaluate its accuracy Ac k and Stability St k . Algorithm 1 shows the initial features pool.
Applying the cross-validation, the Accuracy Ac k,i due to the selected subset features by FS k on dataset D i is given as where t p , t n , f p , and f n respectively represents true positive, true ngative, false positive, and false negative. Accuracy Ac = [0, 1], where Ac k → 1 shows accurate traffic classification whereas Ac k → 0 indicates inaccurate traffic classification. for j = 1 to k do 6: Generate P k 7: Evaluate Ac k and St k 8: Select first features X in ranking 9: P pool ← P pool ∪ P k 10: end for 11: end for 12: return Candidate features as the P Stability St is a measurement to indicate the robustness of the selected features regardless of traffic data variations. A certain FS method may generate different feature sets on datasets collected in different periods or locality due to concept drift. Therefore, it is critical to select features that can yield high prediction Accuracy and better relative Stability over different samples. This study employed the stability measure suggested by [20] to evaluate the distinct feature selection methods.
A FS may respectively generate P a and P b feature subsets from datasets D a and D b , where both maybe unidentical. Let P k = P a ∪ P b . The stability St k of the selected features by FS k over the two datasets can be estimated according to [20] as: where |F| is the total number of features, n i j is the frequency of specific feature f i observed across different datasets D j .

B. STAGE 2: WEIGHTED RANKING OF FEATURES
EWA is based on a weighted ranking measure to select robust features using multiple individual FS methods on different traffic datasets. The idea behind this as a class is superiority over that of individual FS methods, where the most significant features for network traffic classification are probably be endorsed by most FS methods.
A weighted ranking measure for each feature f i is R f i , which is the likelihood that f i is selected by multiple FS methods in different traffic datasets (or none at all), as shown in Equations (2). The mean value of R f i in Equation (5) shows high optimality when avg(R f i ) → 1, whereas avg(R f i ) → 0 indicates low optimality.
Let |D| denotes the cardinality of traffic datasets D, where k represents the total number of FS methods used on a single dataset. A weighted ranking for each feature f i is given as: where O i,j,z denotes the weight of feature f i dependent on its location L i,j,z w.r.t. cutpoint value X for each D j and FS z . The lower the value of L i,j,z indicates its high significance. An optimal threshold value is needed for selecting features that are stable and have high weighted ranks, which are sufficiently unique and reliable. As an example, a feature with a high average ranking weight is considered sufficiently reliable. The threshold B = R f i − avg(R f i ) is determined through experimentation. The higher value of B may not necessarily result in higher accuracy as too few features is required to classify network traffic.
In the second algorithm, firstly, the average weight measures of the features f i are computed. Then each feature which has average weight measures more than or equal threshold is selected and kept into the set of the best stable features subset P ranked subset. Finally, important features containing indispensable information about the original features are selected. In this stage, we apply the wrapper approach to identify the best candidate features as a good search technique. The techniques, in general, are classified into three groups: randomized, exponential, and sequential. This research considers a widely used Sequential Forward Selection (SFS), a sequential search strategy [58]. SFS selects the best combination of subset features for extraction. The selection process begins with an empty set and continuously adds a single feature from the superset to the subset when the accuracy increases.

IV. EXPERIMENTAL SETUP
This section describes the validation of EWA compared to other feature selection methods.

A. VALIDATION PROCEDURE
The validation procedure involves evaluating the proposed EWA feature selection compared to the IG [59], FCBF [53], and GOA method [56] in term of Accuracy (Ac), Stability (St), and Root Mean Squared Error (RSME).
The following software and tools were used to achieve the set objectives of this work: • Batch learning algorithms are frameworks that facilitate the selection of the appropriate attributes for the identification of Internet traffic. Naive Bayes (NB) was used as classifiers. These classifiers have been applied successfully in various works for tackling traffic classification [60]. They were executed in Weka open-source platform [61].
• Weka [61] a data mining software was used to implement the selection of select suitable and correct traffic features.
• A laptop with Intel Core i7-5500U processor, 8 GB RAM, and 1 TB HDD was used for validation purposes.

B. DATASET
EWA was evaluated using the widely acceptable traffic datasets from the University of Cambridge [19] (dataset D 1 to D 1 0). This dataset is among the largest network traffic traces, which is publicly-available and assembled by a high-performance network monitor over different periods from two different network sites. The sites are designated as Site A and Site B, with each site hosts about 1,000 Internet-connected users through a full-duplex Gigabyte Ethernet link. A high-performance network monitors the full-duplex traffic for each traffic set on this connection. Table 2 summarizes the datasets. For the implementation, we used the Weka data mining tool [61]. In the Cambridge dataset case, the early stage-packet statistic is not available without access to the all raw packets. Hence, the complete flow statistics are used. To give an impartial assessment of all datasets, the Cambridge dataset's mean attributes were recomputed to obtain the total attributes.

C. EVALUATION METRICS
Primarily, the proposed EWA is evaluated in terms of Accuracy and Stability as described in Section III. To measure EWA in relation to other similar works, Similarity Si and Root Mean Square Error (RSME) are also used. This research evaluates the similarity with respect of the accuracy of FS techniques when classifying traffic datasets collected from different time and location. The similarity (Si) in term of accuracy between two candidate datasets, D a and D b is defined as where C is the set of ML classifiers used to evaluate the datasets. The Similarity measure takes values Si = [0, 1], where the value close to zero indicates low similarity in accuracy across multiple datasets and ML classifiers, while the value approaching 1 indicates high similarity in terms of accuracy. Root Mean Square Error (RMSE) (quadratic scoring) is used to measure the average magnitude of error. RMSE produces a relatively high weight for large errors. Hence, RMSE is the most technique useful when large errors are particularly unwanted.

V. RESULTS AND DISCUSSION
Based on EWA stages that have been described in Section III, this section explains the results from EWA: • We evaluated seven FS algorithms in order to choose the best top methods.
• The selected FS methods are applied to generate the features pool.
• The weight of the features in the features pool is used to select the best features depending on the threshold.
• Sequential Forward Selection method and Naive-Bayes classifier are applied to select the best combination subset of features.  • Lastly, we compare EWA method with FS methods: IG [59], FCBF [53] and GOA method proposed in [56]. The cutpoint X = 20 is applied for the ranking method. After that, FS methods that achieved higher mean accuracy were selected. Table 3 presents a comparison of classification accuracy for seven (7) FS methods on four Cambridge datasets (D 1 , D 3 , D 6 and D 10 ). Hence, an FS strategy with high mean accuracy is preferred. It is worthy of note the FS methods that give higher accuracy are ranked as follows: Chi-square, PC, GR, IG, CSE, CAE, and CV AE; as presented in Table 3. As a result, we select Chi-square, PC, GR, and IG methods for Stage 1 of EWA.
The selected FS techniques (Chi-square, PC, GR, and IG methods) are compared based on their accuracy and stability (see Figure 3). None of the FS methods outperformed the others in most cases as there is no available FS technique that can satisfy both criteria (stability and accuracy). For instance, the performance of Chi-square FS was good on the accuracy metric but poor on the stability metric. Meanwhile, PC was poor on both metrics, while IG performed well on stability but poor on accuracy.
Therefore, it is concluded each of the evaluated FS methods has its advantages and disadvantages when measured in terms of accuracy and stability. Our motivation for proposing a ranking method based on multi-criterion methods is to identify a stable and optimal subset of features that help traffic classifiers perform well across different times and locations. In this stage also, we evaluated 248 features (see [19]) using four FS techniques (GR, Chi-square, IG, and PC). The experiment utilized ten Cambridge dataset D 1 to D 10 with cutpoint equals twenty is applied, and the best 20 features in the ranking are selected.

B. STAGE 2 RESULT
In this stage, we compute the mean ranking weight of all 248 features F = {f 1 , f 2 , . . . , f 248 } by using four FS techniques on the ten Cambridge datasets and filter out all features that have average weight R f i ≤ 0.005, which reduces the number of features from 248 to 32 features as tabulated in Table 4. Therefore, features with higher mean weight are desired. The features f 1 , f 95 , f 96 , f 180 and f 187 achieve a higher mean weights. Table 4 shows the threshold value (B) for all selected 32 features. Here we set (B) to select the best features depending on the best result during evaluation using the Naive Bayes classifier and the Cambridge datasets. Table 5 tabulates threshold B values and the number of features and their respective accuracy for each range of B. The results explain the value of B ≥ 0.054 is the best accuracy than other values of B.

C. STAGE 3 RESULT
In this stage, the Sequential Forward Selection (SFS) method and the Naive Bayes classifier are applied to select the best feature combinations as the classification features. The SFS method begins with an empty set and continuously adding a   single feature at any time until all possible combinations are tested. Table 6 explains the selection of these features.

D. PERFORMANCE COMPARISON WITH OTHER FS METHODS
Not to be biased with the proposed metrics, EWA is compared with full features, baseline FS methods: IG [59], FCBF [53] and GOA method proposed in [56]. The proposed method was tested and validated using the same metrics. Table 7 presents results of comparison between the proposed method and full features (FF), baseline FS methods: IG [59], Fast Correlation-Based Filter (FCBF) [53] and GOA method proposed in [56]. EWA improves mean accuracy up to 4.2% using Naive Bayes for the 10 Cambridge dataset, and at the same time, it uses the smallest number of features (5 features) compared with others. Figure 4 shows EWA's accuracy achieves a slight improvement over the GOA method, while full features perform poorly. Table 8 shows the comparison in terms of RMSE between EWA and GOA. The results indicate that the EWA approach achieved slight improvement overall compared to other FS  methods for the ten Cambridge datasets, as shown in Table 8, while full features perform poorly. For the RMSE comparison between EWA and GOA, the EWA approach has achieved slight improvement over the GOA method, as shown in Figure 5. Figure 6 shows the comparison between EWA, full features (FF), and other FS techniques (IG, FCBF, and GOA) in accuracy and stability. The full-feature performs very well on the stability but fares poorly in accuracy. The full features set contains many redundant and irrelevant features. Other FS methods such as IG performed poorly on accuracy but performed equally well on stability, while FCBF performs poorly on both metrics. GOA and EWA outperform the other FS techniques (i.e., IG and FCBF) on both stability and accuracy metrics, as various FS techniques are incorporated in GOA and EWA to select different groups of relevant features.
Conventional FS methods may not agree on the same relevant features for these reasons: Different FS methods may select feature subsets that can be considered local optimal in the feature subsets space.
• The search space of any FS technique may be restricted by the technique's representative power such that it may be impossible to reach the optimal subset.
• The combination of more than one approach can produce a better ranking of features or a better approximation to the optimal subset.
• In most cases, EWA outperforms all methods in terms of stability and accuracy. Although GOA and EWA have similar stability, EWA outperforms GOA because EWA is based on a weighted ranking measure that allows the  selection of robust features from multiple FS techniques on different traffic datasets. Figure 7 shows the comparison of full features (FF), FS techniques (IG [59], FCBF [53]) and GOA methods [56] compared to the proposed EWA method in terms of RMSE and time to build the model (runtime (in seconds)). As a result, full features have very high RMSE and Runtime (i.e., using full features). FCBF generates high RMSE and low Runtime, while IG performs equally poorly on both. GOA and EWA methods outperform full features and selected FS techniques (IG, FCBF) in RMSE and Runtime criteria. In most cases, EWA performs better than GOA in terms of RMSE as GOA depends only on the selected feature's frequency. Both EWA and GOA have similar Runtime (s) due to both methods selected only five features for classification. Tables 9 and 10 show the comparison of EWA with GOA in terms of similarity and accuracy. Naive Bayes and decision tree J48 ML classifiers are applied on Cambridge datasets D 1 and D 2 , collected at different times (Table 9) and datasets D 1 and D 2 collected from different locations (Table 10). Results show that EWA performs better than GOA in similarity and accuracy as EWA is based on a weighted ranking measure. This allows a selection of features selected by multiple FS  techniques from different traffic datasets with different time and location heterogeneity.
The simulation results indicate that EWA can perform the selection of stable features that can be applied at different times and location heterogeneity. However, in some practical traffic classification use-cases that require modularity and scalability, such as in hierarchical classification [62], time and location heterogeneity are undesirable. EWA can still be used as the feature subsets are dependent on the used datasets. By categorizing training datasets, different feature subsets for hierarchical traffic classification can be obtained.

VI. CONCLUSION
This paper contributes to the selection of robust feature subsets for the identification of Internet traffic. The Ensemble Weighted Approach (EWA) feature selection method was proposed to select robust subset features for Internet traffic identification. The results of the experiments proved that no singular feature selection technique could perform well on all datasets. Based on this fact, we suggested a method that relies on the positives of the individual FS methods to obtain a robust method. The simulation results on real datasets illustrate EWA's capability to identify robust subset features for Internet traffic identification. Our findings also show that EWA improves mean accuracy up to 1.3% and, at the same time, reduced RMSE up to 0.016 uses a smaller number of features that directly contribute to improving Runtime up to 0.003 seconds). Selected features can build stable traffic identification models that remain accurate regardless of location and time heterogeneity with high similarity above 97%.
For future works, we plan to further analyze EWA for the early estimation of statistical flow features. This is important for real-time traffic identification as only certain features can be extracted on the wire with the limited flow or packet observability. We also plan to enhance ML classification with incremental learning, as there is a need to propose forgetting to enhance traffic classification accuracy over time by removing uninformative features when concept drift happens. Also, a real-time traffic detection system can be integrated with any network traffic management. Program at Howard University, Washington, DC, USA. He is engaged in research and teaching in the areas of cybersecurity, machine learning and wireless networking for emerging networked systems including cyber-physical systems, the Internet-of-Things, smart cities, software-defined systems, and vehicular networks. His professional career comprises more than 15 years in academia, government, and industry. He has delivered over 15 Keynotes and invited speeches at international conferences and workshops. He has also published over 200 scientific/technical articles and nine books. He has secured over $5 million in research funding from the US National Science Foundation, US Department of Homeland Security, Department of Energy, National Nuclear Security Administration His research focuses on specialized hardware architecture and network algorithmics for high-throughput packet and flow processing. He works on dynamically reconfigurable platforms for middlebox, fog and edge computing, software-defined networking, and teletraffic engineering. He also works in domain-specific reconfigurable computing research, focusing on multicore/manycore system-on-chip, network-on-chip, design space exploration, mapping, and prototyping of the homogeneous and heterogeneous manycore SoC.