Research on Detection and Recognition of Abnormal Data caused by Network Intrusion using Deep Learning

: 12 Based on deep learning, this study combined sparse autoencoder (SAE) with extreme learning 13 machine (ELM) to design an SAE-ELM method to reduce the dimension of data features and realize 14 the classification of different types of data. Experiments were carried out on NSL-KDD and UNSW-15 NB2015 data sets. The results showed that, compared with the K-means algorithm and the SVM 16 algorithm, the proposed method had higher performance. On the NSL-KDD data set, the average 17 accuracy rate of the SAE-ELM method was 98.93%, the false alarm rate was 0.17%, and the missing 18 report rate was 5.36%. On the UNSW-NB2015 data set, the accuracy rate of the SAE-ELM method 19 was 98.88%, the false alarm rate was 0.12%, and the missing report rate was 4.31%. The results 20 show that the SAE-ELM method is effective in the detection and recognition of abnormal data and 21 can be popularized and applied.


Introduction
With the expansion of the network and the increasing volume of data [1], the traditional methods are increasingly unable to meet the needs of detection and identification of abnormal data, and cannot achieve effective defense of the network.The detection and recognition of abnormal data can be regarded as a classification problem.Methods such as machine learning have been widely used in the detection of recognition of abnormal data [2] and have achieved good results.Mitchell et al. [3] detected the medical network physical system with a behavior-based method.Through experiments, they found that the method could deal with more covert attacks with a high detection rate.Hosseini et al. [4] designed a method based on multi-criteria linear programming and particle swarm optimization and performed experiments on the KDD CUP 99 and found that it had obvious advantages in accuracy and computing time.Wei et al. [5] used different neural networks to obtain the characteristics of the data for detection and carried out experiments on DARPA 1998 and ISCX2012.The results showed that the method had a good detection rate.Dubey et al. [6] designed a hybrid method based on K-means, naive Bayes, and back-propagation (BP) neural network.They carried out experiments on KDD CUP99 to verify the performance of the method.At present, in the face of massive data, the performance of detection and recognition is not good enough and is greatly affected by the size of the data.Intelligent methods such as deep learning have good detection ability for multi-dimensional dynamic network data; therefore, this paper used deep learning to detect and recognize abnormal data and verified the reliability of the method.This work makes some contributions to further improving abnormal data detection and recognition ability and realizing network security.

Feature extraction based on sparse autoencoder
Autoencoder (AE) [7] is a deep learning network structure.It is assumed that the input of the encoder isO, the middle layer is Z, and the output is O.The purpose of AE is to make I ≈ O.In this process, the output of the encoder can be written as: The output of the decoder can be written as: where J refers to the reconstruction error function.This study uses the mean square error loss function: Sparse autoencoder (SAE) [8] is obtained by adding a sparsity limitation to AE, which enables it to give deeper features, i.e., let the node's output be as 0 as possible.It is assumed that the mean value of the activation degree of node j in the middle layer is: .
where m is the number of data and a j (2) is the output activation value of node j, whose input is x.In the sparsity limitation, to make ρ ̂j as close as possible to 0, a decimal ρ that approaches 0 is introduced as the sparsity parameter, and Kullback-Leible divergence is used to perform regularized constraint on the network.The global loss function of the network is written as: , where s 2 refers to the number of neurons in the middle layer.

Detection and recognition based on extreme learning machine
In the learning process, an extreme learning machine (ELM) [9] can achieve the desired effect by calculating the output weight only, showing a high learning speed [10].For a given training sample {x i , y i } i=1 N , it is assumed that the number of nodes in the hidden layer is L, then where g(x) is an activation function, W i is an input weight, β i is an output weight, and b i is a bias.
The objective of the network is to minimize the output error: .
It can be expressed as Hβ = T by a matrix, where H refers to the node's output in the hidden layer, T is the expected output, and β is the output weight.The solution is: where H + is the Moore-Penrose generalized inverse of H [11].
In the SAE-ELM method designed in this paper, firstly, the dimension of features is reduced by the SAE method.In a given sample set, {(X 1 , Y 1 ), (X 2 , Y 2 ), ⋯ , (X i , Y i )}, X i is the feature vector, and X i is the labeled vector.After the dimensionality reduction, a new {X i , Y i } is obtained.Then, it was detected by the ELM method.

Experimental setup
The experimental platform was MATLAB2014a.The operating system was Win10 64 bits.
The experimental data sets used were NSL-KDD and UNSW-NB2015.NSL-KDD is a benchmark data set [12,13], which is specially used to judge the behavior of network data.Each data has 41 features; there are one class of normal data and four classes of abnormal data, which are DOS, Probe, R2L, and U2R, respectively.Experiments were carried out with 125973 data in KDDTrain, as shown in Table 1.UNSW-NB2015 is a relatively new data set [14], recording the normal activities and attack behaviors of real modern networks [15], which are as follows: (1) normal: normal data; (2) fuzzers: pause the network by providing randomly generated data; (3) analysis: attacks including port scanning and spam; (4) backdoors: access the computer by bypassing the system security mechanism; (5) DoS: users cannot use the server or network resources; (6) exploits: attack the host through vulnerabilities; (7) generic: an attack used for password countermeasure; (8) reconnaissance: collect the information of the victim's host and attack it; (9) shellcode: attack the computer through vulnerabilities of software; (10) worms: attackers copy themselves and propagate to other computers.122 219160 data in one subset were used in the experiment, as shown in Table 2. 123 124 , where T P refers to the number of abnormal data that are classified as abnormal, T N refers to the number of normal data that are classified as normal, F P refers to the number of normal data that are classified as abnormal, and F N refers to the number of abnormal data that are classified as normal.

Experimental results
Firstly, the binary classification experiment was carried out on NSL-KDD, and the results were compared with the support vector machine (SVM) algorithm [16] and the K-means algorithm [17], as shown in Figure 1.
It was seen from Figure 1 that the SAE-ELM method had the best performance in detecting and recognizing abnormal data.The accuracy A c of the K-means, SVM, and SAE-ELM algorithms was 74.64%, 86.48%, and 95.64%, respectively; the A c of the SAE-ELM algorithm was 21.02% higher than the K-means algorithm and 9.16% higher than the SVM algorithm.The F A of K-means, SVM, and SAE-ELM algorithms was 4.67%, 1.89%, and 0.45%, respectively; the F A of the SAE-ELM algorithm was 4.22% lower than that of the K-means algorithm and 1.44 % lower than that of the SVM algorithm.The M A of the SAE-ELM algorithm was 7.41 % lower than that of the K-means algorithm and 4.84 % lower than that of the SVM algorithm.The above results verified that the SAE-ELM algorithm was reliable.
Then, a five-classification experiment was carried out on the NSL-KDD data set, as shown in Table 3.It was seen from Table 3 that the SAE-ELM algorithm had the best performance in detecting and recognizing normal data but performed poorly in detecting and recognizing U2R.The samples of U2R were the least among the different kinds of data, which led to the insufficient training degree of the algorithm.The amount of normal data was the largest; thus, the accuracy of the detection and recognition of normal data was the highest (99.67%).The average A c , F A , and M A of the SEA-ELM algorithm was 98.93%, 0.17%, and 5.36 %, respectively.
A binary classification experiment was carried out on UNSW-NB2015 and compared with SVM and K-means algorithms, as shown in Figure 2.
It was seen from Figure 2 that the performance of the SAE-ELM method was the best on the NSW-NB2015 data set.The A c of the three methods was 80.27%, 92.36%, and 99.42%, respectively.The A c of the SAE-ELM method was 19.15% higher than the SAE-ELM method and 7.06% higher than that of the SVM method.The F A of the SAE-ELM algorithm was 2.85% lower than that of the K-means algorithm and 0.95% lower than the SVM algorithm.The M A of the SAE-ELM method was 6.65% lower than that of the K-means algorithm and 4.06% lower than that of the SVM algorithm.
Finally, the polyphenols experiment was carried out on the NSW-NB2015 data set using the SAE-ELM algorithm, as shown in Table 4.It was seen from Table 4 that, similar to the NSL-KDD data set, the SAE-ELM method had better detection and recognition performance in the category with more samples.For the attack type with less number, A c was relatively small, but all above 95%.The average A c of the SAE-ELM algorithm was 98.88%, the average F A was 0.12 %, and the average M A was 4.31% on the UNSW-NB2015 data set, showing that the SAE-ELM algorithm had a good performance.

Discussion
With the development of society, network security has been paid more and more attention [18].
As the data in the network is becoming more and more massive, high-dimensional, and changeable, the traditional detection and protection methods have not been able to meet the current network security needs [19].Therefore, it is of great significance to find effective detection and identification methods for abnormal data [20].Deep learning methods have been widely used in image recognition [21], speech recognition [22], intelligent translation [23], etc., which can achieve high classification accuracy in large databases.Therefore, this paper analyzed the application of deep learning in the detection and recognition of abnormal data to know whether it can detect and recognize abnormal data quickly and accurately.
It was found from the experiments on NSL-KDD and UNSW-NB2015 data sets that the A c and F A of the SAE-ELM method were better than K-means and SVM algorithms.For the detection and recognition of abnormal data, only larger A c , small F A , and low M A can meet the actual needs.
First, in the binary classification experiment, the A c of the SAE-ELM method was above 98% on the two data sets, and the F A and M A were small.In the multi-classification experiment, the average A c , F A , and M A of the SAE-ELM method were 98.93%, 0.17%, and 5.36%, respectively.On the UNSW-NB2015 data set, the A c , F A , and M A of the SAE-ELM method were 98.88%, 0.12%, and 4.31%, respectively.The two experiments showed that the SAE-ELM method had a good performance.
Although some fruits have been attained on the recognition and detection of abnormal data, more research is still needed: (1) the usability of more deep learning methods should be studied; (2) the actual network operation data should be collected for detection and identification.

Conclusion
Based on deep learning, this paper analyzed the detection and recognition of abnormal data, designed an SAE-ELM method, and carried out experiments on NSL-KDD and UNSW-NB2015 data sets.It was found that the SAE-ELM method had high accuracy and good performance in detecting and recognizing abnormal data, which can be further promoted and applied in practice.
where f I and g Z are activation functions, W is an initial weight, b I is a forward bias, and b Z is a reverse bias.AE minimizes reconstruction error by training {W, b I , b Z }:

Table 3
Results of the five-classification experiment on the NSL-KDD data set

Table 4
Results of the multi-classification experiment on the UNSW-NB2015 data set