Ensemble Method for Anomaly Detection on the Internet of Things

The growth in the number of applications, equipment and protocols connected to the Internet of Things (IoT) generates data with significant heterogeneity and traffic volumes and data imbalances continue to increase. On the other hand, new kinds of attacks on IoT networks are made possible by advancements in technology and knowledge. Given the substantial volume of data traffic, a detection mechanism must be able to discern various forms of attacks. Ideally, the attack detection system must be reliable in unbalanced data distribution. Chi-square feature selection was chosen to deal with large data dimensions. In order to enhance intrusion detection on imbalanced data, an ensemble method is proposed in this study. The optimal detection approach is created by combining several classification methods, including Bayes Network (BN), Naive Bayes (NB), REPTree, and J48. This study used the CICIDS-2017 dataset because it has been tested and is frequently used in IDS research. Ensemble-3 is superior to other approaches and previous studies by evaluating its performance.


INTRODUCTION
The transition from referring to the Internet as a general concept to the more specific term "Internet of Things" has significant implications for the network's continued growth of data IJCCS ISSN (print): 1978-1520, ISSN (online): 2460-7258 ◼ Ensemble Method for Anomaly Detection on the Internet of Things (Kurniabudi) 27 implemented the Chi-Square technique and ensemble method on the CICIDS-2017 dataset for detecting Benign traffic, Infiltration, DDoS, Web Attack XSS, Port Scan, Web Attack Brute Force, Bot, DoS Slowloris, and Web Attack SQL Injection.The CICIDS-2017 was chosen as the dataset because this dataset is reliable [14], and has been widely used in IDS research.This research contributes to producing a reliable detection method that can detect attacks on highdimensional and unbalanced data.This study aims to develop an ensemble method to detect attacks on the IoT.For this reason, several steps were taken, first by implementing chi-square as a feature selection to deal with high-dimensional data problems.Second, to test and evaluate how well classification algorithms like J48, NB, REP Tree, and BN detect attacks on unbalanced data.Thirdly, to detect attacks and evaluate the proposed ensemble method's performance, make a proposal for an approach that combines a number of classification algorithms.Finally, a comparison of the proposed method's performance to that of previous studies was carried out in order to assess the method's dependability.

METHODS
This study proposes a reliable detection method for high-dimensional and unbalanced data.This research was conducted through 5 stages, as presented in Figure 1

Data Preparation
Data preparation is done to eliminate unused features, redundant features and Overcoming missing values.The data used in this study is the CICIDS2017 from the ISCX UNB dataset UNB [15].This dataset is used because it accurately depicts the complexity of actual network traffic.CICIDS 2017 consists of 2,830,743 data records collected over six (six) days during eight (eight) observation sessions in various scenarios.In addition, normal data (benign) and attack data (attack) have been added to this data.The CICIDS2017 dataset contains a variety of attack types, including Heartbleed, Brute Force, DoS, DDOS, Web attack, Infiltration, Bot, and Port Scan.This research only uses 30% of the data from the CICIDS-2017 dataset version of Machine Learning CSV.Thus the total data used is 849,223 records.The reason for using only 30% of the CICIDS-2017 dataset is due to the limited computational resources that researchers use.This test used a 2.70 GHz Intel Core i7 processor, 8 GB of RAM, and the Windows 10 operating system for testing purposes.The software for the analysis tool was WEKA 3.8.5, and the configuration for the heap size was 3072 MB.Nevertheless, the portion of the data that was utilized represented the requirements of the experiment.

Feature Selection
By eliminating irrelevant features, feature selection reduces data dimensions.It has been demonstrated that feature selection is an effective and efficient method for preparing highdimensional data for various machine-learning problems [16].High-dimensional data can have an impact on the computation of machine learning algorithms.Selecting features within datasets with many dimensions is an excellent way to eliminate redundant information and unnecessary details [17].Feature selection is used to solve the "Curse of dimensionality" problem, by eliminating irrelevant features, thereby reducing the computing time of the detection system.
The feature selection approaches employed in this study were the attribute evaluator chi-square and the ranker-based search method.The feature selection approach employed in this study is the Chi-square method.The chi-square test is employed to eliminate variables that are deemed insignificant in the statistical model.This method measures the weight of dependency between features and classes [22].The Chi-square is calculated by applying equation 1. (1) The feature 't' and label of class 'c' frequency of recurrence in the dataset is represented by W. The frequency of occurrence of "t" in the absence of "c" is denoted by the symbol "X," whereas the frequency of "c" in the absence of "t" is denoted by the symbol "Y."The letter Z denotes the frequency of occurrence of any other entity except 'c' or 't.'Last but not least, N denotes the total number of entries in the data set.This study uses the Chi-Square method to select features from the CICIDS 2017 dataset.The feature selection process with Chi-Square is shown in the following pseudocode.

Anomaly Detection
At this stage, anomaly detection (attack) testing is carried out using a classification algorithm.Based on the study about IDS, the researchers used Machine Learning to detect data traffic attacks.Some studies applied NB, BN, J48, and REPTree as classifier algorithms for anomaly detection.The following is a brief review of some of the classification algorithms used in this study: − The Naive Bayes (NB) method uses the Bayes theorem to calculate the likelihood that a given data point will be classified into a particular group.The idea that the importance of each trait has no bearing on the class is referred to as "naive."Due to its simplicity and effectiveness, this method has found widespread application in various settings [13].− The Bayes Network (BN) is a probabilistic graphical model employed to depict the interrelationships among variables of significance.This method's accuracy depends on a few presumptions regarding the target system model's fundamental behavior.The detection accuracy can be decreased if these assumptions are corrected [20].− The "Reduced Error" concept refers to the reduction or minimization of errors in a given context or system.It involves the identification and implementation of strategies or techniques.The Pruning Tree algorithm, also known as REPTree, is a DT technique that utilizes the principles of a regression tree and iteratively constructs numerous trees.The algorithm chooses the tree in the set deemed most reflective of the data and selects it.The size employed for tree trimming corresponds to the mean squared error of the predictions generated by the tree [21].− One popular machine learning algorithm, J48 or C4.5, is usually included in decision tree algorithms.Using the idea of entropy, it makes a DT from a training dataset [22].One notable difference between this algorithm and IDE3 is how the DT is constructed, as J48 or C4.5 can process both continuous and categorical attributes.

Creating Ensemble Method
The Ensemble method proposed in this research is to use the majority voting technique, which consists of a combination of several classification algorithms.During the training process.the input data will be processed by each algorithm used.At the end of the process.a vote will be taken on the classification results.Of course.The results of the best classification will be shown in the output.This study employed and evaluated the NB algorithm, BN, J48, and REPTree as traffic classification techniques.In this experiment, Three ensembles were proposed, which were named Ensemble-1.Ensemble-2.and Ensemble-3.The following is its configuration: − Ensemble-1 (E1): using Majority Vote with Naïve Bayes, J48, and REPTree algorithms.− Ensemble-2 (E2): using Majority Vote with BN, J48, and REPTree algorithms.− Ensemble-3 (E3): using Majority Vote with Naïve Bayes, BN, J48, and REPTree algorithms.
The training set mode has been used as an ensemble method test.This means all the data input will be analyzed for details.See the following pseudocode.

Comparing Detection Performance
This study will evaluate the efficacy of the suggested ensemble method for anomaly detection in the IoT context.The confusion matrix is a fundamental tool for evaluating IDS research performance.Referring to the definitions generated by the confusion matrix.IDS performance can be measured by : True Positive Rate (TPR), False Positive Rate (FPR), Precision, F-Measure or F1-Score, (Accuracy).

RESULTS AND DISCUSSION
This part is dedicated to presenting the outcomes derived from the executed experiments.The discussion revolves around the results of feature selection and the evaluation of the performance of machine learning algorithms.The topic of interest pertains to the evaluation of performance in ensemble methods.

Feature Selection Result
In this section, the results of feature selection testing are presented.The CICIDS-2017 dataset has 79 data traffic features on the network.Not all of these features are used to recognize attacks.Apart from reducing data dimensions, feature selection is carried out to select relevant features.As elucidated in the preceding session, the chi-square approach was employed for determining feature choice in this study.The chi-square test is employed to categorize traffic characteristics and identify elements that exhibit statistical significance to both benign and attack traffic.Table 1 displays the chi-square-selected features.
Table 1 The selected features using Chi-square Techniques By applying the chi-square feature selection technique, from 79 features, 23 features were selected.Selected features are presented in Table 3.These selected features will then be used to detect anomalies using the ensemble method.These selected features will then be used to detect anomalies using the ensemble method.

Performance Comparison
In addition, this study compared the proposed method's performance to that of more recent or traditional approaches.This study employs NB, BN, J48, and REPTree as classification methods.The proposed Ensemble method's performance is also compared to these methods.The objective of this comparative analysis is to assess the reliability of the suggested methodology.The TPR, FPR, Precision and the F-Measure values are employed to conduct comparisons.
The TPR values for each method are shown in Table 2.The TPR values for each classification method are shown in Table 4.The performance of each classification method in detecting attacks on the CICIDS 2017 dataset is shown by this TPR value.It can be deduced from these TPR values that the Ensemble-3 approach is superior to other approaches when detecting attack traffic.With a TPR value greater than 0.970, almost all types of traffic can be correctly identified, according to the TPR ensemble-3 value, except for traffic caused by Web Description: E1=Vote(NB+J48+REPTree). E2=Vote(BN+J48+REPTree). E3=Vote(NB+BN+J48+REPTree) In Table 3, the FPR values for each method are presented.The lowest average value of FPR is 0.000.Furthermore, the highest is 0.298.So, almost all algorithms have good FPR values.
Table 3  The precision values for every classification algorithm are presented in Table 4. Based on the test findings, it can be observed that Ensemble-1 displays an average precision score of 0.931, while Ensemble-2 offers an average precision score of 0.956.Additionally, Ensemble-3 highlights an average precision score of 0.979.On the other hand, it can be observed that the J48 classifier demonstrates an average precision score of 0.961, which exceeds the precision scores of ensembles 1 and 2. Upon careful examination of the different methods, it becomes apparent that ensemble-3 exhibits the highest average precision value.In Table 5.The F-Measure values are presented as the output of each classification algorithm.Based on the F-measure value for each traffic class, Ensemble-3 has a better Fmeasure value when compared to other methods.Description : E1=Vote(NB+J48+REPTree). E2=Vote(BN+J48+REPTree). E3=Vote(NB+BN+J48+REPTree)

Accuracy
The accuracy testing for anomaly detection in this study, which involved several classification algorithms and ensemble approaches, is presented in Figure 5. Based on the accuracy value, The findings suggest that the ensemble technique demonstrates superior performance in accuracy compared to the NB, BN, J48, and REPTree algorithms.The graph presented in Figure 4 shows an accuracy value of 99.88% achieved by the J48, E1, and E2 (E2) algorithms.This accuracy value is very good when compared with NB, Network Bayes, J48, and the REPTree algorithm.Nevertheless, the accuracy of E3 stands at an impressive 99.93%, surpassing the accuracy values of NB, BN, J48, RepTree, E1, and E2.Therefore, the suggested approach, mainly E3, exhibits enhanced performance in anomaly identification.A confusion matrix is utilized, as was previously mentioned, to evaluate the detection method's efficacy.The TPR, FPR, Precision, and F-Measure and Accuracy values are determined using the confusion matrix.Ensemble-3 outperforms the other methods used in this study in terms of performance based on the test results for each method, taking into account the TPR, FPR, F-Measure, and Accuracy values.

Comparing with previous work
In order to establish the reliability of the proposed methodology, a comparison with the previous study is done.Table 6 provides a comparative analysis of the ensemble method's efficacy, as prior scholars suggested, concerning accuracy metrics.In contrast to prior studies, the proposed methodology exhibits a higher level of efficacy.This research aims to improve attack detection on IoT networks characterized by large data traffic volumes.The chi-square feature is used as a selection approach to overcome the challenges of solving high-dimensional data.By providing recommendations regarding essential features and their relevance through weight ranking, the dimensionality of the data can be reduced.In this study, a total of 23 features were selected through feature selection to differentiate between normal network traffic and malicious behavior effectively.An ensemble method is proposed to improve attack detection on high-dimensional data with imbalanced data.The proposed ensemble method combines state-of-the-art classification algorithms, namely Bayesian Network, Naïve Bayes, J48, and REPTree.The proposed ensemble methods, especially ensemble-3, show performance that outperforms other classification algorithms used in this study.The comparison results with previous research show that the accuracy value of the proposed method is superior.
Although this research has produced an ensemble method with outstanding performance, several weaknesses must be corrected in future research.This research uses the WEKA tool to test, utilizing the Use Training Set mode.In the future, it is necessary to test with various test modes, such as Fold and Split Data cross-validation.This research can still be developed to optimize the detection of Web Attack SQL Injection attacks, Web Attack XSS, and Bot attacks by using more effective feature selection techniques and other classification algorithms.Because this is initial research, this research only uses 30% of the CICIDS2017 dataset.Future research will be tested with 100% CICIDS2017 dataset and will test the method using the relevant and latest IDS dataset.
. These steps include Step 1, data preparation; Step 2, feature selection; Step 3, anomaly detection; Step 4, creating ensemble methods; and Step 5, comparing detection performance.Each step is explained in the next section.

Figure 1
Figure 1 Research Framework presents the result of the ensemble-2 test.Based on TPR, Precision and F-Measure values, Ensemble-2 is able to detect Benign traffic, DoS_GoldenEye, PortScan, DDoS, Bot, HeartBleed, FTP_Patator, SSH_Patator, Infiltration, DoS_Slowloris, DoS_httptest, DoS_Hulk, and Web_Attack_BruteForce.The test results also show the increased precision value of the Web_Attack_Sql_Injection and Web_Attack-XSS attacks.

Figure 5 .
Figure 5. Accuracy of the Proposed Method

find ideal value of weighted_normal_matrix ideal ← fin_ideal(wnormal) #compute chi-square distance between weighted_normal_matrix and ideal
Comparison of FPR Values Ensemble Method for Anomaly Detection on the Internet of Things (Kurniabudi) 33

Table 4
Comparison of Precision Values

Table 5
Comparison of F-Measure Values

Table 6 .
Comparison with previous research Ensemble Method for Anomaly Detection on the Internet of Things (Kurniabudi) 35 4. CONCLUSIONS