DDoS Attack Detection via Multi-Scale Convolutional Neural Network

: Distributed Denial-of-Service (DDoS) has caused great damage to the network in the big data environment. Existing methods are characterized by low computational efficiency, high false alarm rate and high false alarm rate. In this paper, we propose a DDoS attack detection method based on network flow grayscale matrix feature via multi-scale convolutional neural network (CNN). According to the different characteristics of the attack flow and the normal flow in the IP protocol, the seven-tuple is defined to describe the network flow characteristics and converted into a grayscale feature by binary. Based on the network flow grayscale matrix feature (GMF), the convolution kernel of different spatial scales is used to improve the accuracy of feature segmentation, global features and local features of the network flow are extracted. A DDoS attack classifier based on multi-scale convolution neural network is constructed. Experiments show that compared with correlation methods, this method can improve the robustness of the classifier, reduce the false alarm rate and the missing alarm rate.

algorithm to detect DDoS attacks accurately and effectively. The attack characteristics and detection optimization of the algorithm are analyzed.

DDoS attack feature extraction 3.1 Analysis of DDoS attack characteristics
By studying the typical DDoS attack cases, the DDoS attacks have the characteristics that the attack sources are wide distributed and strong concealment of attack sources [Petkovic, Basicevic, Kukolj et al. (2018)], specific characteristics are as follows: Wide distribution of attack sources. The source IP address and destination IP address of the attack have a "many-to-one" relationship. When a DDoS attack occurs, a large number of downtimes are controlled and simultaneously attack the specified target [Yuan, Li and Li (2017)]. The attacker can forge the source IP address of the attack packet continuously or randomly, making the distribution of source IP address decentralized more disperse [Yu, Hu and Wang (2018); Cheng, Xu, Tang et al. (2018)]. It makes the distribution of source IP address, source port and destination port number more dispersed. In the case of attack, most of the packets sent by the attacker are not segmented, and the number of packets with the IP flag of 0X4000 (Do not Fragment) will increase significantly. Strong attack power. Because DDoS attacks use multiple attack sources to launch attacks at the same time, the traffic generated by each attack source is aggregated to form a huge attack traffic [Gao, Cheng, He et al. (2018)]. It breaks the upper limit of the processing power of the attacked target in a short period of time, causing the target system to fall into paralysis [Mamolar, Pervez, Calero et al. (2018); Cheng, Zhang, Tang et al. (2018); Mirkovic and Reiher (2004)]. For SYN Flood, the attacker will send a lot of SYN requests, and the server will consume a lot of resources to retry SYN+ACK. The attacker will send a lot of TCP flag bit 0X02 (SYN) packets. Due to the consumption of server resources, the returned TCP flag bit 0X10 (SYN+ACK) packets will gradually decrease with the increase (of)in attack intensity [Cheng, Liu and Tang (2018); Wang, Ma, Zhang et al. (2016)]. In the case of attack, the distribution state of TCP flags will change significantly. This paper proposes grayscale matrix feature (GMF) by analyzing network traffic. Based on this feature, the DDoS convolutional neural network classifier is constructed, and the attack characteristics and detection performance of the algorithm are optimized.

Feature extraction rule
Given a network flow with n sample IP packets, we define each IP packet as �t i , s i , d i , sp i , dp i , size i , tf i , if i �, t i denotes the arrival time of the packet i, s i and d i denotes its source IP and the destination IP, sp i denotes its source port and the destination port, size i denotes its packet size, tf i and if i denotes its TCP flags and IP flags respectively. Execute the following rules for these n packets: (1) Binary conversion In order to preserve each original attribute of the network flow F, we perform number conversion on the above s i , d i , sp i , dp i , size i , tf i , if i . For the hexadecimal conversion, the bit-weight conversion method [Suzuki and Murayama (1985)] is used. Any hexadecimal data can be in the form of a sum of polynomials spread by bit weight. For example, the number N can be expressed by the following formula: (1) According to formula 1, we convert s i , d i , sp i , dp i , size i , tf i , if i to binary data respectively.
(2) Formal conversion Among them, due to the problem that the number of bits in the network flow is inconsistent after being converted into binary, the digits are formally converted as following formula: L is the threshold, which is the length of (data i ) 2 . (data i ) 2 represents the binary form of data i , len[(data i ) 2 ] represents the digits of binary data data i . Because the original format of IP and port number is 32-bit binary, then set the threshold L=32. Converting the source IP address s i , destination IP address d i , source port sp i , destination port dp i of the packet of to 32-bit binary data. Statistically, the length of the data package is less than 4096 bytes (2 12 bytes), then set the threshold L=12, converting the packet size size i to 12-bit binary data. Because IP flags and TCP flags are all hexadecimal data in a given dataset, then set the threshold L=16, converting TCP flags tf i , IP flags if i to 16-bit binary data.
(3) Sampling by time Converted by the above number system to obtain a binary form network flow F ' =< �t i , s i , d i , sp i , dp i , size i , tf i , if i �>, i =1, 2, …, n Definition 1. Based on the binary representation, the network flow data is sampled by unit time ∆t, Packet sampling time T∈(0,N),∆t=0.01s, 0.05s, 0.1s, we extract the packet set(PS): In the definition of PS, in order to analyze the state characteristics of the PS more efficiently, statistics on the network flow F ' =(t i , s i , d i , sp i , dp i , size i , tf i, if i ) per unit time are performed, we analyze the law of network traffic generation. DDoS attack is a process in which an attacker uses a large number of forged source IP addresses to send useless packets to the victim host, consuming the target host resources and causing the attack. Therefore, when DDoS attacks occur, a large number of false source IP addresses will be generated per unit time, source IP addresses will increase, and destination IP addresses will be relatively single. The number of different destination port increases abnormally when useless packets are sent from the attack source IP to multiple target ports of the target host in a unit time.
In the process of TCP packet transmission, attackers will forge addresses to send SYN requests to servers, and a large number of packets with TCP flag bit 0X02 (SYN=1) will appear, at the same time, the server sends a packet with a flag of 0X10 (SYN, ACK=1) to the requester for confirmation. Because the requester is a forged address, the server will not receive a response. With the increase of attack traffic, the server will consume a lot of resources to handle this kind of semi-connection, which will eventually lead to server crash. Statistics show that, in the process of attack, the proportion of packets with the IP flag of 0X4000 (Don't Fragment) sent by the attacker will increase significantly. We extract GMF feature based on the above rules, and the network traffic eigenvalue is obtained according to the corresponding sampling time ∆t. Because convolutional neural networks require consistently the size of input data, we traverse the PS matrix and perform grayscale encoding processing on the network flow features according to the following formula: a i is the network flow eigenvalue component. According to the above feature extraction rules, we extract GMF feature: As shown in Fig. 1, we can see the network flow distribution of normal flow. The transverse axis of the matrix includes s i , sp i , d i , dp i , size i , tf i , if i , the longitudinal axis of a matrix represents the number of packages. There are more "one-to-one" resource access modes. Normal flow rate is relatively low, and source IP addresses are relatively concentrated. The source port and destination port number are relatively concentrated. The value of TCP flags is relatively stable, and the IP flags are various, there are not only a certain number of segmented packages but also a certain number of unscheduled packages in the network flow. As shown in Fig. 2, we can see that the distribution of GMF feature changes in the case of DDoS attacks. There are many "many-to-one" resource access modes. Under the attack state, network flow embodies high flow characteristic, the number of data packets collected under the current sampling time is large. Source IP addresses and source ports are scattered, destination IP addresses are centralized and destination ports are scattered. Significant changes in packet size distribution. The proportion of unbranched data packets increased significantly. TCP flags changed with the distribution of attack start state. We can find the GMF features we extracted by binary conversion from original data. The time and space distribution of network flow attributes can be more accurately reflected by the representation of network flow related attributes in the form of matrix. They can more comprehensively express the spatial relationship and time-distance relationship of data packets.
The existing DDoS attack detection methods generally use statistical methods to extract network flow features. By analyzing the state changes of normal flow and attack flow the characteristic sequence of network flow is extracted by statistic the related attributes of network packets. The feature of network flow based on statistical often results in information loss to a certain extent in the statistical process. Statistical-based methods can't fully and accurately reflect the characteristics of network flow. In summary, our proposed GMF features can more accurately reflect the distribution of data packets and the spatial relationship between data packets. Compared with statistical network flow feature sequences, GMF feature has stronger feature expression ability in spatial and temporal relationships.

Multi-scale convolutional neural network classifier 4.1 Matrix normalization
Since the sampled number of network flow features during the sampling time is different, we use the gray map mapping method to map the grayscale features: In the formula (5), width is the original width of the grayscale matrix, the weight is the original height of the grayscale image, W and H are threshold values. W is the threshold of width, H is the threshold of height. When the statistical sampling time is 0.01 s, according to statistics, the number of data packets collected does not exceed 300. When the statistical sampling time is 0.05 s, according to statistics, the number of data packets collected does not exceed 800. When the statistical sampling time is 0.1 s, according to statistics, the number of data packets collected does not exceed 1500. When sampling time is 0.01 s, set the width threshold W=172, the height threshold H=300. When sampling time is 0.05 s, set the width threshold W=172, the height threshold H=800. When sampling time is 0.1 s, set the width threshold W=172, the height threshold H=1500. We obtain the grayscale network flow features. We divide the training set, verification set and detection set by the proportion of 0.8, 0.1 and 0.1. Training set is used for model fitting, validating set is used to adjust the hyper parameters of the model and preliminarily evaluate the capability of the model. Test set is used to evaluate the generalization ability of the model. Based on the features extracted from the above methods, this paper uses CNN to build the detection model. It has become a research hotspot in the current image field. Its weight sharing network structure makes it more similar to biological neural network, which reduces the complexity of the network model and the number of weights. This advantage is more obvious when the input of the network is multi-dimensional image and multidimensional matrix. The image and matrix can be directly used as the input of the network, avoiding the complicated feature extraction and data reconstruction process in the traditional recognition algorithm. Convolutional network is a multi-layer perceptron specially designed to recognize two-dimensional shapes. This network structure has some invariance to translation, scaling, and other forms of deformation. In a typical CNN, it generally represents the alternation of the convolutional layer and the pooling layer. The last few layers of the network near the output layer are usually fully connected networks. The training process of convolution neural network learns network parameters such as convolution kernel parameters and interconnection weights of convolution layer. The prediction process is mainly based on the input image and network parameters to calculate the category label. The GMF feature proposed by us is the arrangement of high-dimensional matrices, because convolutional neural network has good performance for high-dimensional matrix processing. We use GMF features to train CNN model. Different convolution kernel sizes are determined according to the bit length of feature components in GMF feature. Because of IP address, port number is 32-bit binary, packet length is 12-bit binary data, due to the fact that TCP flag bit and IP flag bit are 16-bit binary data, we use convolution kernels of different scales for feature extraction. We put the matrices into the convolution layer: In formula (6) and formula (7), u l j is the net activation of the j-th channel of convolution layer l, it is gained by convolution summation and offsetting the previous layer output feature map x i l-1 , x i l is the output of the j-th channel of convolution layer l. f (·) is an activation function and uses functions such as sigmoid function and tanh function. M j represents a subset of input feature maps used to calculate u l j , k l ij is the convolution kernel matrix, b l j is a bias to the convolution feature map. For an output feature map x l j , the convolution kernel k l ij corresponding to each input feature map x l-1 j may be different. "*" is the symbol of convolution.
Then we put u l j into pooling layer: In formula (8) and formula (9), u l j is the net activation of the th channel of the pooling layer l, It is obtained by pooling and offsetting the output feature map x i l-1 of the previous layer, β is the weighting factor of the pooling layer, b l j is the offset of the pooling layer, down(·) is the pooling function. It divides the input feature map x l-1 j into multiple nonoverlapping n×n image blocks by sliding window method. The pixels in each image block are then summed, averaged or maximized, and the output image is then reduced by n times in both dimensions.
In a fully connected network, splicing the feature maps of all 2D images into onedimensional features as input to a fully connected network, the output of the fully connected layer l can be obtained by weighting the inputs and obtaining the response through the activation function.
x l =f �u l � (10) u l =w l x l-1 +b l (11) In formula (10) and formula (11), u l is the net activation of the fully connected layer l, it is obtained by weighting and offsetting the output map x l-1 of the previous layer. w l is the weight coefficient of the fully connected network, b l is the offset of the fully connected layer l. The convolution model used in this paper includes two convolutional layers, two pooling layers, two local layers and a softmax layer to build our model.

Figure 3: Construction of the multi scale GMF-CNN Model
As shown in Fig. 3, we optimized the convolution kernel size by mapping three 3*3 convolution kernels into 4*4, 8*8 and 16*16 multi-scale changes to construct the multi scale GMF-CNN model. The advantage is that, due to source IP, destination IP, source port number, and destination port number are 32-bit binary data. Due to the fact that we use 4*4, 8*8 convolution kernels adapt to the data format better. In addition, the 16*16 size convolution kernel can realize the dimensionality reduction of data, further adapt to the data form, and improve the detection capability of the model.

Dataset and evaluation criteria
The experimental hardware devices in this article are 8G memory, i5 processor, and they are implemented in windows10 64bit system, Python 3.6.2 | Anaconda 4.2.0 (64-bit) environment. The data set was experimented with the CAIDA "DDoS Attack 2007" data set, which contained approximately one hour of distributed denial of service (DDoS) anonymous traffic attacks on August 4, 2007. This type of attack attempts to block access to the target server by consuming the computing resources on the server and all the bandwidth connected to the Internet. The total size of the data set is 2 GB, accounting for about one hour. The attack started at about 21:14, causing the network load to grow rapidly, from about 200 kilobits per second to 80 megabits per second. One hour of attack traffic is divided into 5 minutes of files and stored in PCAP format. In order to reasonably judge the effectiveness of the proposed attack detection experiment, we use some evaluation indicators to fully explain its detection performance, including detection rate (DR), false alarm rate (FR), error rate (ER). Assuming that TP is the number of normal samples that are correctly marked, TN is the number of attack samples that are correctly marked, FN is the number of attack samples that are incorrectly marked, and FP is the number of normal samples that are incorrectly marked. The detection rate is the probability that the actual attack can be detected. The false alarm rate describes the proportion of samples that are judged to be aggressive in normal samples. The error rate is the probability that the user behavior is wrongly judged.

Comparison of experimental results
In order to verify the detection ability of our proposed GMF feature combined with the multi-scale convolutional neural network method, the following features and algorithm comparison experiments were carried out.

Comparison of features
(1) Detection rate comparison We extract features according to the methods described in Cheng et al. [Cheng, Zhang, Tang et al. (2018)]. According to the feature extraction rules in this paper, we use FFV statistical feature values of one-dimensional features of quintuple features for experimental comparison.  Fig. 4, when the sampling time is 0.01 s, we can see that in terms of detection rate, with the increase of epochs, our proposed GMF features converge faster than FFV statistical features, and the GMF feature has a higher detection rate, reaching about 94%, but FFV feature has only 85%. Thus, when the sampling time is 0.01 s, the number of data packets of unit time is less, and the multi-scale convolution model can extract more microscopic features.   Fig. 6, when the sampling time is 0.1 s, we can see that in terms of detection rate, with the increase of epochs, our proposed GMF feature converge faster than FFV statistical feature, and the GMF feature has a higher detection rate, reaching about 93%, while we can see that the statistical characteristics of FFV oscillate obviously with the increase of iterations in the training process. We can find that GMF feature has higher detection rate and model adaptability. As shown in Tab. 1, we can see that our proposed GMF feature has a better detection rate, lower false alarm rate and lower total error rate under multi-scale convolution model. From there we can see that Statistical-based features will result in some information missing due to statistical steps, which makes the extracted features unable to fully reflect the characteristics of network flow. The GMF feature matrix, because of binary preprocessing of the original data, arranges the binary data in the form of a highdimensional matrix. It is more obvious to characterize the characteristics of network flow and better reflect the distribution of data packet attributes.

Comparison of multi-scale model and CNN model
(1) Comparison of multi-scale model and CNN model when the sampling time is 0.01 s.   As shown in Fig. 9, when the sampling time is 0.1 s, the convergence speed of the multiscale optimization model is faster than that of the conventional CNN model. The loss function of the multi-scale optimization model converges in the third epoch, while the loss function of the non-optimization model converges in the seventh epoch. We can conclude that the training speed of multi-scale convolution neural network model is faster and the convergence is more stable. Therefore, the multi-scale model can better adapt to the GMF feature. By comparing the above three sampling time models, we can see that the proposed multiscale convolution neural network model has faster convergence speed and better model stability in the training process. Tab. 2 shows the performance comparison of GMF features under the multi-scale CNN model and CNN model. We can see that the proposed method has a higher detection rate than other methods, reaching 94.87%. Multi-scale model has better detection performance for our proposed features. Compared with CNN methods, the proposed method has lower false alarm rate and total false alarm rate. Therefore, the CNN method has better performance in extracting the features of multi-dimensional matrix. By optimizing the parameters of the model, we build a multi-scale kernel model, which has better adaptability to the GMF matrix proposed in this paper.

Conclusion
Aiming at the problem of false alarm rate and missing alarm rate in DDoS attack detection methods in big data environment, we propose a DDoS attack detection method based on convolutional neural network. Based on GMF, the convolution layer at different spatial scales to imporve the segmentation accuracy, the global and local features of the network flow are extracted to resist over-fitting and improve computational efficiency. The network flow isomorphism output of the full connectivity layer is sent to the softmax classifier to take advantage of the contextual relationship of the features to improve classification accuracy. The classifier is trained by normal samples and DDoS attack samples to obtain optimal network parameters, and a DDoS attack classifier based on multi-scale convolutional neural network is constructed. Experiments show that this method has higher accuracy than similar detection methods, reduces false alarm rate and lost alarm rate, and it can effectively detect DDoS attacks under big data.