A Filter Feature Selection Algorithm Based on Mutual Information for Intrusion Detection

For a large number of network attacks, feature selection is used to improve intrusion detection efficiency. A new mutual information algorithm of the redundant penalty between features (RPFMI) algorithm with the ability to select optimal features is proposed in this paper. Three factors are considered in this new algorithm: the redundancy between features, the impact between selected features and classes and the relationship between candidate features and classes. An experiment is conducted using the proposed algorithm for intrusion detection on the KDD Cup 99 intrusion dataset and the Kyoto 2006+ dataset. Compared with other algorithms, the proposed algorithm has a much higher accuracy rate (i.e., 99.772%) on the DOS data and can achieve better performance on remote-to-login (R2L) data and user-to-root (U2R) data. For the Kyoto 2006+ dataset, the proposed algorithm possesses the highest accuracy rate (i.e., 97.749%) among the other algorithms. The experiment results demonstrate that the proposed algorithm is a highly effective feature selection method in the intrusion detection.

Unlike feature extraction which creates new features from original data features, feature selection is a process of selecting the best and most relevant subset of features from the original data features. It is divided into three categories: the filter method [11,12], the wrapper method [13,14] and the embedded method. The filter method selects the most useful features from original features and does not depend on model types. On the contrary, the wrapper method evaluates the feature subsets from the original features, which is part of a model or a learning algorithm. The embedded method combines the filter method and wrapper method.
In the filter methods, feature selection is always related to a certain evaluation function. According to different evaluation functions, the filter methods are divided into five categories: distance, information (or uncertainty), dependence, consistency and the classifier error rate [15].
In recent years, many feature selection algorithms have been proposed [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35]. When feature selection is applied properly, it can significantly improve classification processing time and performance. A feature selection method based on deep learning is shown for intrusion detection in Reference [6], which can improve the detection rate and reduce the false positive rate. However, it needs to add more time for parameter design and cannot achieve good results for small samples. Although the best individual feature model (BIF) is the simplest method based on mutual information (MI), it does not consider the redundancy between features [18]. In the mutual information feature selection (MIFS) method proposed by Battiti [19], the MI between features is applied to evaluate the feature correlation. Nevertheless, the impact between candidate features and classes is not taken into consideration. In the minimal-redundancy and maximal-relevance (mRMR) method described by Peng et al. [20], the impact between candidate features and classes is neglected. In Reference [21], the improved mutual information feature selection (MIFS-U) was described but it does not consider the redundancy between candidate features and classes. In Reference [22], the conditional mutual information-based feature selection (mMIFS-U or CIMI) method was proposed. It uses the conditional MI to evaluate the feature importance and assumes that the class has no effect on the feature, which is not true in feature selection. The modified mutual information-based feature selection (MMIFS) method [32] has the disadvantage of ignoring the impact between features and classes in the penalty and the flexible mutual information-based feature selection (FMIFS) method [33] does not consider the relationship between candidate features and classes. In Reference [34,35], a unifying viewpoint on the existing information theoretic feature ranking literature was presented but it does not consider the relationship between candidate features and classes and the impact between selected features and classes in the penalty.
There are three penalty factors that affect feature subset selection: MI between features to reduce redundancy between features, MI between selected features and classes and MI between candidate features and classes. However, A weakness of the already reported feature selection algorithms is that they only consider the part of penalty factors to affect the feature subset selection and intrusion detection efficiency. Aiming at the shortcomings of the above algorithms, a new filter-based feature selection algorithm is proposed in the paper. The new algorithm considers three penalty factors to maximize relevancy and minimize redundancy between features. By using the proposed algorithm, the selected feature subset is superior to those selected by the above algorithms and the intrusion detection performance can be effectively improved in Section 4.
The paper is structured as follows. In Section 2, we introduce some concepts about MI. Section 3 reports the proposed new filter-based feature selection algorithm. Experimental results with this feature selection algorithm are shown in Section 4. Finally, conclusions with a discussion on future work are presented in Section 5.

Related Technologies
In this section, we describe some basic concepts about information theory and feature selection, which are used in the proposed feature selection algorithm.
In the classification process, the relevant features containing important information about the classification result are useful, whereas irrelevant features, which are also known as redundant features, are not preferred because they have little useful information about the classification result. Therefore, the purpose of feature selection is to select as many as possible relevant features and to avoid irrelevant features. As such, tools to measure whether the feature is related to the classification result are highly required, including information theory which provides a way to measure the correlation using MI [36,37].
The entropy, as a function of probability, describes the uncertainty of a random variable. Given two continuous variables X = {x 1 , x 2 , ..., x n } and Y = {y 1 , y 2 , ..., y n }, where n is the total number of features, the entropy and MI [19][20][21][22] are defined as where H(X) and H(Y) are the information entropy of variables X and Y, respectively [32]; H(X; Y) is the joint entropy of variables X and Y [33]; p(x), p(y) and p(x, y) are the probability density functions. The MI of variables X and Y can be written as [38,39]: It can be seen that I(X; Y) is symmetrical [22]. In discrete forms, given two discrete variables .., f n } and C = {c 1 , c 2 , ..., c n }, where n is the total number of features or classes, the entropy and MI are defined as [18,19]: where F is the original feature set of the input data; C is the class set of the output; P( f ), P(c) and P( f , c) represent the probability functions.

Filter Feature Selection Algorithm
This section consists of two parts. The first part introduces a new filter feature selection algorithm. The second part presents the theoretical analysis of the new algorithm and the comparison between the new algorithm and other algorithms.
In the process of feature selection, many types of features containing sufficient information are selected from the original data to determine the output class. In MI-based feature selection proposed by Battiti [19], the task maximizes the relevance between the selected features from the original data and the output class and minimizes the redundancy of the selected features.
In order to maximize the relevance between the selected features from the original data and the output class, I(C, f i ) is computed. In order to minimize the redundancy of the selected features, the function of the redundant penalty between features (RPF) can be calculated as: where I(C, f i ) is the MI of class C and candidate feature f i . I(C, s j ) is the MI of class C and selected feature s j ; I( f i , s j ) is the MI of candidate feature f i and selected feature s j . Therefore, the feature selection criterion is: where .., f n } is the original feature set of the input data, S = {s 1 , s 2 , ..., s k } is the selected feature set from the original feature set F, |S| is the number of selected features in S and C = {c 1 , c 2 , ..., c n } is the class set of the output. Our algorithm named as the redundant penalty between features mutual information algorithm (RPFMI) can be described in Algorithm 1.

Algorithm 1.
The Proposed New Algorithm.

The Redundant Penalty Between Features Mutual Information Algorithm (RPFMI)
Input: F ← {initial set of all n original features}, S ← {empty set} Output: S ← S ∪ f + 01 for i = 1; i ≤ n; i + + do 02 calculate H(C), H( f i ), H(C, f i ) and I(C, f i ) 03 end for 04 select the feature f * ∈ F that maximizes I(C, f i ) 05 F ← F\{ f * } and S ← { f * } 06 for f i ∈ F and s j ∈ S do 07 compute I( f i , s j ) and I(C, s j ) 08 end for 09 while F = φ do 10 select the feature f + ∈ F using Equation (10) 11 F for f i ∈ F and s j ∈ S do 14 compute I( f i , s j ) and I(C, s j ) Given the importance of features and the relevance between features, the proposed selection feature algorithm can only rank the features and cannot optimally select the subset of the selected features. Therefore, we start with the best feature and incrementally add features to the classifier one by one. The final optimal feature subset is selected in the training data [32].
On the one hand, the proposed RPFMI algorithm is based on MI, which is the same as other algorithms, such as MIFS [19], mRMR [20], MIFS-U [21], CIMI [22], MMIFS [32] and FMIFS [33] methods. If other algorithms are valid, the proposed RPFMI algorithm is valid. The effectiveness of the proposed algorithm will be demonstrated with experimental results in Section 4.
On the other hand, the proposed RPFMI algorithm is different from other algorithms in terms of penalty factors, which considers not only the relationship between features but also the relationship between selected features and classes. Moreover the RPFMI algorithm takes into account the relationship between candidate features and classes. Involving considerations into these three relationships makes the proposed algorithm more effective than other algorithms.
To analyze the complexity of the proposed filter feature selection algorithm, the training data is arranged in the following manner: M is the number of each sample feature; N is the number of samples; and |S| is the number of the selected features. In the first three phases of the proposed algorithm,

Experiment and Results
In this section, the proposed algorithm is applied to the intrusion detection. Firstly, the dataset used in the experiment is introduced and the intrusion detection data preprocessing is described. Secondly, performance metrics are applied to evaluate the proposed algorithm. Thirdly, by using the dataset, the proposed algorithm is tested and compared with other methods.

Data Set
The datasets used in this paper are the Knowledge Discovery and Data Mining (KDD) Cup 1999 [23][24][25] and Kyoto 2006+ datasets [33,40]. Although the KDD Cup 1999 dataset has existed for a long time, it is still the standard tagged dataset [27][28][29][30] and has been widely used in the research and evaluation of network intrusion detection. In comparison, the Kyoto 2006+ dataset is relatively new. It is built on real 10-year traffic data from 2006 to 2015, which are obtained from diverse types of honeypots. Researchers specialized in intrusion detection may take advantage of the Kyoto 2006+ dataset to obtain more practical, useful and accurate evaluation results. It can also be used as a public dataset for verifying network intrusion detection algorithms.
Although the KDD Cup 99 data set has some shortages, it is widely used as a benchmark for IDS evaluation. There are three independent datasets: the entire KDD training data, 10% KDD training data and KDD correct data. In the KDD Cup 99 dataset, every network connection represents a data record that consists of 41 features and a label specifying the status of this record.
In the 10% KDD training data, the label includes the normal class and 22 attack types [26,27]. The 22 attack types are divided into four groups: the remote-to-login (R2L), the denial-of-service (DOS), the user-to-root (U2R) and the Probe.
In the KDD correct data, the label includes the normal class and 37 attack types. The 37 attack types are also divided into four groups: the R2L, the DOS, the U2R and the Probe. In the 37 attack types, there are 17 new attack types, which are not present in the 10% KDD training data.
Like many experiments, the size of the dataset is reduced by random selection in our experiment. The data used is the 10% KDD training data and the KDD correct data. The distributions of these data are shown in Tables 1 and 2. As mentioned above, in the KDD Cup 1999 dataset, each record contains 41 features: 3 nonnumeric features and 38 numeric features. These nonnumeric features are the protocol type, service and flag, and must be transformed into numeric data. The protocol type has three kinds of types: tcp, udp and icmp. Based on the different types, the "protocol type" feature is transformed into three features. Because the "service" feature containing 70 different types will heavily increase the dimensionality, it is not used in our experiments. As can be seen in Table 3, the nonnumeric feature conversion is achieved. The Kyoto 2006+ dataset is described for evaluating the performance of intrusion detection in Reference [40,41]. In Section 4.3.5, the dataset in 2015 from the Kyoto 2006+ dataset is used. There are 24 features for each data in the Kyoto 2006+ dataset [40,42]. In addition, 4 nonnumeric features and 20 kinds of numeric features are included in 24 features. These nonnumeric features need to be converted to numerical ones.
Data normalization is a process of scaling the value of each feature into a well-proportioned range so that the bias in favor of features with greater values is eliminated from the dataset [33]. In the detection process, the test data are normalized by the Min-Max standardized method. The data conversion is shown as following: where f is a particular feature of normalization; Min( f ) is the smallest value in a feature column; and Max( f ) is the largest value in a feature column [3]. Every feature falls into the same range (0-1).

Performance Metrics
In order to quantify the detection performance and effectiveness of the proposed algorithm, four performance metrics are applied, which are the detection rate (DR, also known as the true positive rate), precision rate (PR), false positive rate (FPR) and accuracy (ACC) [28]. They can be calculated using the confusion matrix in Table 4. In Table 4, True Positive (TP) is the number of attack samples correctly predicted as attacks; False Positive (FP) is the number of normal samples incorrectly predicted as attacks; True Negative (TN) is the number of normal samples correctly predicted as normal; False Negative (FN) is the number of attack samples incorrectly predicted as normal [43,44]. According to Table 4, these four performance metrics are defined as follows.
The DR is the proportion of attack samples that are correctly predicted as attacks in the test dataset; it is an important metric reflecting the attack detection model's ability to identify attack samples and can be written as: The PR is the ratio of the number of actual attack samples to the number of all attack samples predicted in the test dataset [45]. It measures the number of correct classifications penalized by the number of incorrect classifications [3] and is described as: The FPR is the ratio of the number of normal samples that are incorrectly predicted as attacks in the test dataset to the number of all attack samples predicted in the test dataset [46]; it is a metric that reflects the ability to identify normal samples and is defined as: The ACC is the ratio of the number of samples correctly predicted in the test dataset to the total number of samples [43,44]; it is an overall evaluation metric that reflects the ability of the detection model to distinguish between normal and attack samples and is written as:

Experiment and Analysis
The computer environment of the experiments is 3 GB memory, 500 GB hard disk, windows 2007 operating system and 2.93 GHz CPU. To evaluate the detection accuracy of the proposed feature selection algorithm, the Support Vector Machine (SVM) classifier is selected and to give a radial basis function kernel [29][30][31]. After the feature selection, the proposed algorithm is tested by the classification algorithm. As three types of attack data in the KDD Cup 99 dataset [23][24][25] are used in the experiments, three SVM classifiers are needed. For every SVM classifier, there are two types of data in the training and test data: normal data and attack data. According to the distributions of the training and test data, the number of DOS type is larger than the numbers of U2R and R2L types. Therefore, the ratio of the normal data to attack data is 1:1 in the experiments with DOS type. In addition, the ratio of the normal data to attack data is 9:1 in the experiments with U2R and R2L types. Also, the ratio of the normal data to attack data is 1:1 in the experiments of Kyoto 2006+ dataset [33,40]. Every experiment result is represented by the mean value from 100 experiments performed on the KDD Cup 99 and Kyoto 2006+ datasets using the proposed feature selection algorithm.

Denial-of-Service Test Experiment
This section presents the detection performance of the proposed feature selection model RPFMI and other models for DOS and normal samples in the KDD Cup 99 test set. In the training phase, the numbers of normal and DOS samples are 10,000 and 10,000, respectively, while in the test phase, the numbers of normal and DOS samples are 2000 and 2000, respectively.
With the different feature selection methods and the SVM classifier, the selected features are summarized in Table 5. It is shown that the number of selected features using the MMIFS [32] method is smallest. The numbers of selected features in the MIFS-U (β = 0.3) [21] and RPFMI methods are the same.  Table 6 shows the confusion matrices regarding the normal and DOS data in different models. It is obvious that the sum of the FP and FN in the RPFMI + SVM model is the smallest. Comparisons between different models in terms of the ACC, DR, FPR and PR are performed. Compared with the other feature selection algorithm, the highest ACC (i.e., 99.772%) can be obtained with the proposed algorithm ( Figure 1). Figure 2 shows that RPFMI + SVM model has the same DR as other feature selection methods. The FPR of RPFMI + SVM is 0.003%, which is only larger than that of the CIMI model [22], as shown in Figure 3. Although the PR of the RPFMI model ranks second (Figure 4), it is higher than the other feature selection methods. Therefore, the RPFMI algorithm is advantageous over the other feature selection methods in improving both the ACC and PR and reducing the FPR.

User-to-Root Test Experiment
In this section, the detection performances of different models are shown for the U2R and normal samples in the test set. In the training phase, the numbers of normal samples and U2R samples are 1800 and 200, respectively, while in the testing phase, the numbers of normal samples and U2R samples are 1800 and 200, respectively.
In the U2R test data, due to the small amount of data, repeated sampling is used. The selected features with different feature selection methods are summarized in Table 7 and are used to detect U2R. It can be seen that the numbers of selected features with the MIFS ( 3 . 0   ) [19], mRMR [20], MMIFS [32] and RPFMI algorithms are almost the same.

User-to-Root Test Experiment
In this section, the detection performances of different models are shown for the U2R and normal samples in the test set. In the training phase, the numbers of normal samples and U2R samples are 1800 and 200, respectively, while in the testing phase, the numbers of normal samples and U2R samples are 1800 and 200, respectively.
In the U2R test data, due to the small amount of data, repeated sampling is used. The selected features with different feature selection methods are summarized in Table 7 and are used to detect U2R. It can be seen that the numbers of selected features with the MIFS (β = 0.3) [19], mRMR [20], MMIFS [32] and RPFMI algorithms are almost the same.  Table 8 shows the results of the confusion matrices for the normal and U2R data in different models. Among these 10 models, the RPFMI + SVM models has fourth largest TN and the second highest TP. The ACC, DR, FPR and PR with these models are illustrated. The highest ACC (i.e., 96.19%) is obtained with the RPFMI algorithm ( Figure 5). Compared with the other models, the RPFMI model has the second highest DR ( Figure 6) and the FPR of RPMI is only larger than those of the MIFS [19] and FMIFS [33] models (Figure 7). The RPRMI model ranks second in terms of the PR (Figure 8). Therefore, the proposed RPFMI algorithm demonstrates its ability to improve both the ACC and the PR. models. Among these 10 models, the RPFMI + SVM models has fourth largest TN and the second highest TP. The ACC, DR, FPR and PR with these models are illustrated. The highest ACC (i.e., 96.19%) is obtained with the RPFMI algorithm ( Figure 5). Compared with the other models, the RPFMI model has the second highest DR ( Figure 6) and the FPR of RPMI is only larger than those of the MIFS [19] and FMIFS [33] models (Figure 7). The RPRMI model ranks second in terms of the PR (Figure 8). Therefore, the proposed RPFMI algorithm demonstrates its ability to improve both the ACC and the PR.

Remote-to-Login Test Experiment
This section presents the detection performance of different models for the R2L and normal samples in the test set. In the training phase, the numbers of normal and R2L samples are 1800 and 200, respectively, while in the testing phase, the numbers of normal and R2L samples are 1800 and 200, respectively.
In Table 9, the selected features of the different feature selection methods, which are used to detect R2L, are summarized. It can be seen that the number of selected features with the RPFMI algorithm is smallest.

Remote-to-Login Test Experiment
This section presents the detection performance of different models for the R2L and normal samples in the test set. In the training phase, the numbers of normal and R2L samples are 1800 and 200, respectively, while in the testing phase, the numbers of normal and R2L samples are 1800 and 200, respectively.
In Table 9, the selected features of the different feature selection methods, which are used to detect R2L, are summarized. It can be seen that the number of selected features with the RPFMI algorithm is smallest.  Table 10 shows the results of the confusion matrices for the normal and R2L data in different models. Among these methods, the RPFMI algorithm has the smallest FP and the largest TN. Comparisons between these methods regarding the ACC, DR, FPR and PR are carried out. The ACC of the RPFMI algorithm (i.e., 91.077%) is highest (Figure 9). Compared with other methods, the DR of the RPFMI algorithm ranks fourth in the DR (Figure 10), whereas its FPR is only larger than those of the MIFS [19], mRMR [20] and CIMI [22] algorithms ( Figure 11). The largest PR obtained with the RPFMI algorithm (i.e., 99.403%) is also confirmed ( Figure 12). In summary, the proposed RPFMI algorithm can improve the ACC and PR.  Table 10 shows the results of the confusion matrices for the normal and R2L data in different models. Among these methods, the RPFMI algorithm has the smallest FP and the largest TN. Comparisons between these methods regarding the ACC, DR, FPR and PR are carried out. The ACC of the RPFMI algorithm (i.e., 91.077%) is highest (Figure 9). Compared with other methods, the DR of the RPFMI algorithm ranks fourth in the DR (Figure 10), whereas its FPR is only larger than those of the MIFS [19], mRMR [20] and CIMI [22] algorithms ( Figure 11). The largest PR obtained with the RPFMI algorithm (i.e., 99.403%) is also confirmed (Figure 12). In summary, the proposed RPFMI algorithm can improve the ACC and PR.  9. Accuracy on the R2L test data. Figure 9. Accuracy on the R2L test data.

Kyoto 2006+ Test Experiment
This section presents the detection performances of different models for the attack and normal samples in the Kyoto 2006+ dataset of which the dataset in 2015 is used. In the training phase, the numbers of normal and attack samples are 10,000 and 10,000, respectively, while in the testing phase, the numbers of normal and attack samples are 2000 and 2000, respectively.
In Table 11, the selected features of the different feature selection methods, which are used to detect attacks, are summarized.  Table 12 shows the results of the confusion matrices with the normal and attack data in different models. It can be seen that the FP of the RPFMI + SVM model is only bigger than those of the CIMI [22] and FMIFS [33] models. The ACC, DR, FPR and PR of various models are presented. The ACC of the RPFMI algorithm (i.e., 97.749%) is highest ( Figure 13). The DR of the RPFMI algorithm is higher than the MIFS-U [21], CIMI [22], MMIFS [32] and FMIFS [33] algorithms ( Figure 14) and its FPR is only larger than the CIMI [22] and FMIFS [33] algorithms ( Figure 15). In terms of the PR, the RPFMI algorithm ranks third among all the other methods ( Figure 16). As a consequence, the ability of the proposed RPFMI algorithm to enhance the ACC and decrease the FPR has been verified.

Kyoto 2006+ Test Experiment
This section presents the detection performances of different models for the attack and normal samples in the Kyoto 2006+ dataset of which the dataset in 2015 is used. In the training phase, the numbers of normal and attack samples are 10,000 and 10,000, respectively, while in the testing phase, the numbers of normal and attack samples are 2000 and 2000, respectively.
In Table 11, the selected features of the different feature selection methods, which are used to detect attacks, are summarized.  Table 12 shows the results of the confusion matrices with the normal and attack data in different models. It can be seen that the FP of the RPFMI + SVM model is only bigger than those of the CIMI [22] and FMIFS [33] models. The ACC, DR, FPR and PR of various models are presented. The ACC of the RPFMI algorithm (i.e., 97.749%) is highest (Figure 13). The DR of the RPFMI algorithm is higher than the MIFS-U [21], CIMI [22], MMIFS [32] and FMIFS [33] algorithms ( Figure 14) and its FPR is only larger than the CIMI [22] and FMIFS [33] algorithms ( Figure 15). In terms of the PR, the RPFMI algorithm ranks third among all the other methods ( Figure 16). As a consequence, the ability of the proposed RPFMI algorithm to enhance the ACC and decrease the FPR has been verified.

Conclusions and Future Work
In this paper, a new filter-based feature selection algorithm called the RPFMI algorithm, which is on a basis of MI, has been proposed. In this algorithm, three factors are considered: the redundancy

Conclusions and Future Work
In this paper, a new filter-based feature selection algorithm called the RPFMI algorithm, which is on a basis of MI, has been proposed. In this algorithm, three factors are considered: the redundancy

Conclusions and Future Work
In this paper, a new filter-based feature selection algorithm called the RPFMI algorithm, which is on a basis of MI, has been proposed. In this algorithm, three factors are considered: the redundancy between features, the impact between selected features and classes and the relationship between candidate features and classes. Through the proposed RPFMI algorithm, a good subset of features is selected to improve the accuracy of intrusion detection. Moreover the experiments show the proposed RPFMI algorithm can be well applied to large and small samples. For large samples, the proposed RPFMI algorithm can improve ACC. For small samples, the proposed RPFMI algorithm can improve ACC and get good result in PR on the premise of high ACC.
In future work, we will use the proposed RPFMI feature selection algorithm for anomaly detection with Byzantine fault tolerance [47,48]. In addition, the data distribution can impact the performance of the feature selection algorithm and needs to be considered.
Author Contributions: F.Z. proposed the idea and conceptualization. F.Z. performed experiments, data analysis and scientific discussions and wrote the article. J.Z., X.N. and S.L. revised the clarity of the work as well as helping to write and organize the paper. Finally, Y.X. assisted in the proper preparations, English corrections and submission of the paper.