A Network Traffic Classification Method Based on Graph Convolution and LSTM

In the identification of normal and abnormal traffic flows, Convolutional Neural Network (CNN) is commonly used to extract spatial features of network traffic at present. However, its limitation is that the one-dimensional form of traffic flow data needs to be converted into two-dimensional form, without considering the potential spatial correlation between traffic flows. In view of the potential correlation between network traffic flows, this paper proposes a classification method based on graph convolution and Long-Short Term Memory (LSTM). First, perform data preprocessing on the traffic flow data, then use the graph convolutional network to extract the spatial features of spatial topology and use LSTM to extract its temporal features. Finally, the performance of the algorithm is evaluated on the sampled UNSW-NB15 data set. Experimental results show that the proposed method can effectively extract the potential features of network traffic data. Compared with other methods such as feature selection, bidirectional LSTM (BiDLSTM) and CNN-LSTM, it proves the effectiveness of the proposed algorithm and performs better in classification performance.

not considered. In addition to the relationship between the features within a traffic flow, there will also be a certain correlation between the traffic flows, such as the temporal correlation between the current traffic and the past traffic, and the spatial correlation between the traffic with the same source IP or the same destination IP. Therefore, designing a deep learning model with better feature extraction ability has important worthiness for researching. Through the above analysis, this paper proposes a network traffic classification method based on graph convolution and Long-Short Term Memory (LSTM). This method uses the good topology extraction ability of the graph convolution model to extract the spatial features of network traffic data, and uses the LSTM model to extract its temporal features.
The main contributions of this paper are as follows: (1) We propose a network traffic classification method based on graph convolution and LSTM, which can improve the accuracy of traffic classification, increase the detection rate of abnormal traffic, and reduce the false alarm rate of normal traffic. (2) In order to evaluate the performance of the proposed network traffic classification model, we not only evaluate the overall metrics of the model, but also calculate VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the metrics in each class and compare them with feature selection methods and other deep learning models (such as CNN-LSTM, CNN, etc.). The evaluation process uses the sampled UNSW-NB15 as the benchmark data set. The rest of the paper is organized as follows. Section II reviews the related works in the field. Section III introduces the detailed construction process of the proposed traffic classification model. The experimental comparison and performance evaluation are presented in Section IV. Section V concludes the work and makes an outlook.

II. RELATED WORK
In the field of deep learning, related researchers apply deep learning algorithms to the classification of network traffic, such as applying restricted Boltzmann machines to the classification of DoS traffic [6], using Artificial Neural Network (ANN) to detect the malicious traffic [7], the application of deep belief network in network traffic classification [8] and so on. Since the network traffic data itself also has potential temporal and spatial features, the temporal feature is reflected in the current and past traffic flows, and the spatial feature is reflected in the topological correlation between the traffic flows. Therefore, the spatial and temporal features also have a certain influence on the recognition of normal and abnormal traffic. Relevant researchers have applied CNN to the spatial feature extraction of network traffic, and have achieved certain achievements [9], [10]. Riyaz and Ganapathy [5] proposed a feature selection method based on conditional random fields and linear correlation coefficients to select the most contributing features, and then used the CNN model for further feature extraction to improve the performance of network traffic recognition. Xu et al. [11] proposed the LSTMs-AE model, which combines LSTM with the AutoEncoder (AE). The model utilizes LSTM's time series feature extraction ability and AE's feature representation learning ability to improve performance. Azizjon et al. [12] used the 1D-CNN model for supervised learning of network traffic temporal features, and through experiments to verify that its performance is better than traditional machine learning models such as random forest and SVM. After preprocessing the original traffic data, Xu [13] used image processing technology to convert traffic data into grayscale images, and then used CNN to convolve the grayscale images of traffic to extract the correlation between features. Ling [14] processed the spatial features of the data by using multiple CNNs with different scale convolution kernels, and combined with LSTM to extract temporal features. Imrana et al. [15] proposed the bidirectional LSTM (BidLSTM) model for the classification of abnormal traffic, and verified its performance to be better than LSTM and other models.
Applying LSTM to the extraction of network traffic features can effectively extract the time series features between traffic flows. Although the application of CNN to the extraction of traffic spatial features also has a certain performance improvement, however CNN is more suitable for processing Euclidean structural data such as images. The form of network traffic data is usually a one-dimensional structure, and the spatial relationship between traffic flows is more similar to a topology structure. Graph convolution model [16] has a good feature extraction capability for topological structure and has been widely applied in some fields. Zhao et al. [17] proposed a combination of graph convolutional network and Gated Recurrent Unit (GRU) to extract the temporal and spatial features of traffic roads and make more accurate predictions of road traffic flow. The results show that its performance is better than traditional time series regression models such as ARIMA and SVR. Yao et al. [18] construct a single text graph for the corpus based on word co-occurrence and document word relationship, and then learn the text graph convolutional network for the corpus. Compared with other methods, the performance of this model is more prominent.
By analyzing the application status and limitations of the above works, the graph convolution model is still in the exploratory phase. In the field of network security, the application of graph convolution model in network traffic feature extraction has important research significance.
In summary, many feature extraction methods have been proposed in recent years, but most of them still have some limitations, such as: • The CNN model used for network traffic spatial feature extraction mainly considers the relationship between network traffic data features, and does not consider the spatial relationship between traffic flows.
• The experimental results of some methods are mainly evaluated on overall metrics, which need to be further verified from multiple metrics. Based on the above studies and findings, we propose a network traffic classification method based on graph convolution and LSTM, and evaluate the performance of each class on multiple metrics. By using graph convolution and LSTM to extract temporal and spatial features of network traffic data, we find that the proposed method has a certain degree of improvement in the performance of normal and abnormal traffic compared with other methods.

A. SGC MODEL
Graph Convolutional Network (GCN) is widely used in learning graph representation [19]- [21], which can extract spatial features of topological structures. SGC (Simple Graph Convolutional) [22] has made some optimizations on the basis of GCN, which removes the complex nonlinear transformation on GCN, and greatly reduces the computational time complexity of the model through pre-calculation. This section mainly introduces the process of SGC on the classification problem. Let an undirected graph be denoted as G = (V , A), where V R n represents the node sets {v 1 , v 2 , . . . , v n } of the graph, A ∈ R n×n represents the adjacency matrix of G, and this matrix is a symmetric matrix. The element a ij in A indicates whether there is an edge between the nodes v i and v j , if it exists, it is 1.
denotes the degree matrix of the node, and this matrix is a diagonal matrix.
Each node v i in the graph has a corresponding d-dimensional feature vector x i ∈ R d , then the feature matrix X ∈ R n×d contains the feature vector of n nodes, and each node in the graph belongs to a specific class. According to the adjacency matrix and the nodes of known class, then the class to which the node of the unknown class belongs can be predicted.
For the k-th graph convolutional layer, let H (k−1) represents the input of the k-th layer, and let H (k) represents the output node representation of the k-th layer. Then we can get H (0) = X , where X is the input to the first graph convolutional layer. In the feature propagation process of the k-th layer of SGC, the hidden feature representationh i (k) of node v i is the average value of its local neighbors. The update rules are as follows: If there is an edge between node v i and v j , then a ij is 1, and the feature representation of node v j will affect node v i . The coefficients of h (1) can be expressed by matrix multiplication as follows: Equation (2) hasÃ = A + I , where I ∈ R n×n is the identity matrix,D is the degree matrix of matrixÃ, and S represents the normalized matrix after adding self-circulation. Then the operation of equation (1) can be expressed as follows: where H (k−1) represents the matrix formed by the feature representations of all nodes at the (k − 1)-th layer. After local smoothing, each layer corresponds to a learnable weight matrix . The original GCN does the following nonlinear transformation toH (k) in equation (4): where σ is the activation function, (k) is the weight matrix of the k-th layer. In SGC, the non-linear transformation is removed to speed up the calculation, and it becomes a linear transformation as shown below: Therefore, equation (5) can be further transformed, as shown in the following equation: For the above equation, S is determined byÃ andD. Then S k can be pre-calculated, which involves the multiplication of multiple sparse matrices and can greatly reduce the time complexity of model training.

B. LSTM MODEL
As a special RNN, LSTM is mainly used to solve the longterm dependency problem of RNN [15], [23]. LSTM avoids the problem of gradient disappearance by complicating the structure of the hidden layer unit. The basic unit of LSTM is shown in Fig.1. The uniqueness of the LSTM model lies in the three gate control structures, which are the forget gate, input gate and output gate in the figure above. The functions of these three gate structures are as follows: (1) Forget gate: it is used to control whether the unit is forgotten, that is, the state of the upper hidden unit is forgotten in the current LSTM unit with a certain probability, corresponding to the following equation: where σ represents the sigmoid activation function, W f and b f are the corresponding weight and bias respectively, h t−1 is the output of the hidden layer at time t − 1, and x t is the input at time t. After processing by sigmoid function, the value of f t falls into the interval (0, 1), which represents the probability of forgetting the state of the previous hidden unit.
(2) Input gate: responsible for processing the input of the current sequence, corresponding to the following equation: where W i and W c are weight matrices, and b i and b c are bias. The input gate is divided into two parts, which use sigmoid and tanh activation functions respectively, and the results of these two parts are multiplied to update the unit state. VOLUME 9, 2021 (3) Output gate: responsible for outputting the hidden state h t at time t, as shown in the following equations: where W o and b o are the corresponding weight and bias. It can be seen that o t is calculated from the previous hidden layer h t−1 and the input x t of this layer through the sigmoid function. C t is the current (time t) unit state, which is calculated by the following equation: When classifying normal and abnormal traffic flows, there will be spatial features (the topological structure correlation between traffic flows) and temporal features (the correlation between current and past traffic flows). In order to extract potential features of traffic flows, this section proposes a network traffic classification model based on graph convolution and LSTM (SGC-LSTM) to improve the normal and abnormal traffic recognition performance. Fig.2 shows the structure of the proposed SGC-LSTM model, which includes the SGC graph convolutional layer, LSTM layer, fully connected layer and output layer. Firstly, the original data is preprocessed, and the topological graph is constructed according to the correlations between traffic flows. Then input the preprocessed data into the SGC model, extract its spatial feature representation, and input the output of SGC into the LSTM layer to extract temporal feature representation. After the LSTM layer, a fully connected layer and an output layer are added for model training.

1) INPUT DATA PROCESSING
For numerical features, because different features have different measurement methods, in order to avoid the impact of measurement on the data, it is necessary to standardize the data. Assuming that there are m records in the data set, X ij represents the value of the j-th feature of the i-th record, where 1 ≤ i ≤ m, then the features are standardized as follows: where MEAN j represents the average value of the j-th feature data in the data set, represented by equation (14), and STD j represents the standard deviation of the j-th feature data, as shown in equation (15): Then the input matrix can be constructed. Assuming that there are n records in the training set, and the feature dimension of each record is d, then the training feature matrix X of the input model can be expressed as follows: After constructing the feature matrix, the next step is to construct the adjacency matrix of the graph, that is, to establish the edge relationship between traffic flows. There are four fields among the features of network traffic flows such as 'srcip', 'dstip', 'sport', and 'dsport', which indicate the source IP address, destination IP address, source port, and destination port respectively. A connection can be established between the traffic flows based on these fields, and the experiments in this paper are based on the following four hypothetical rules for establishing a connection between traffic flows. For traffic flow A and flow B, if they meet one of the following rules, then an undirected edge is established between them.
The elements of matrix A satisfy a ij = a ji , 1 ≤ i, j ≤ n, where a ij and a ji represent whether there is a connection between the i-th and the j-th traffic flow. If it exists, then a ij = a ji = 1, otherwise, a ij = a ji = 0.

2) SGC-LSTM FEATURE EXTRACTION LAYER
The SGC layer is mainly reflected in the local smoothing of nodes and their neighbors. First, the matrix S is constructed. It can be seen from equation (2) that S matrix is related to the A andD matrices, whereÃ is A plus an identity matrix, and its form is as follows: a 12 a 13 . . . a 1n a 21 1 The element ofÃ on the diagonal is always 1, which means that each traffic flow node must be connected to itself. Then the form of matrix D is as follows: The calculation process ofD isD = D + I , as shown is equation (20): whered j = d j + 1, 1 ≤ j ≤ n, then the form of S can be multiplied by equation (21), as shown at the bottom of the page.
The outputX of the k − th layer of SGC is used as the input of LSTM, and the output of the LSTM layer is obtained through a series of operations in the LSTM unit.

3) FULLY CONNECTED LAYER AND TRAINING PROCESS
In summary, the traffic classification algorithm based on the SGC-LSTM model is summarized as Algorithm 1. The input of the fully connected layer is the output of the LSTM layer, and the number of nodes in the fully connected layer is set to 32. This paper mainly discusses whether there is a spatial correlation between abnormal and normal traffic. Therefore, the experiment will be discussed on the binary-classification problem, so the activation function of the output layer can use the sigmoid function.
The SGC-LSTM model is trained by minimizing the crossentropy loss function. Assuming there are n samples, let z i represents the score of the i-th sample as a positive example, y i represents the true class of the i-th sample, and σ represents the sigmoid activation function. Then the form of the crossentropy loss function is as follows: In the training process, RMSProp optimizer is used to optimize the parameters of the model. For the hyperparameters in the SGC-LSTM model, we conducted 100 iterations of training on the data set, and selected the hyperparameters with the highest accuracy as the optimal parameters. The hyperparameters settings of the model are as follows: the number of SGC model graph convolutional layers and LSTM model layers are set to 3, the number of nodes in each layer of LSTM is 32, the value of 'epoches' is set to 500, learning rate is set to 0.01, and 'batch_size' is set to 64. Considering that over-fitting may occur in the training stage, a dropout of 0.1 is used for the LSTM layer and the fully connected layer.

D. MODEL EVALUATION
The KDD99 and NSL-KDD datasets have been used as benchmark datasets in the field of network security, and have made important contributions to the development of network security [24]- [26]. However, a lot of current research has shown that for the current network environment, these data cannot fully reflect network traffic and modern   9 establish matrix A and D based on connection rules 10 calculate S based onÃ andD, initialize H = S 11 for i from 2 → k do 12 H = HS 13 end 14 get the outputX = HX 15 inputX into the LSTM layer 16 add a fully connected layer, whose value is 32 17 add a dropout, whose value is 0.1 18 get cross-entropy loss by z i and y i 19 update parameters by RMSProp with loss low-occupancy attacks. The UNSW-NB15 data set is collected by the Australian Cyber Security Centre, which can better reflect the situation in the network environment. Based on this, the experiment in this paper uses the UNSW-NB15 data set [27]. Due to the large scale of the graph convolution node and the limitation of machine memory, the experiment is carried out on the UNSW-NB15 training set and test set with 20% data in stratified sampling, and the proportion of sample classes in the training set and test set was retained. After sampling 20% of the data set, the number of samples in some attack class is smaller. Therefore, the experiment is mainly verified on the binary-classification problem. The distribution of normal and abnormal traffic flows in the training set and test set after sampling is shown in Table 1.

1) CONFUSION MATRIX AND METRICS
The confusion matrix describes the number of samples in the data set that are correctly and incorrectly classified by the classifier, and is often used in classification problems [28]. Take the abnormal class as Positive and the normal class as Negative. Then the form of confusion matrix is shown in Table 2.
Accuracy refers to the ratio of samples whose predicted class is consistent with the actual class, which is expressed as follows: The precision of abnormal class refers to the ratio of the true abnormal samples to the overall traffic records identified as abnormal, as shown in equation (24) below: The recall of abnormal class refers to the ratio of the number of abnormal records correctly classified as abnormal to the overall abnormal samples. It can also be called Detection Rate (DR), as shown in the following equation: The f 1 − score is a comprehensive metric of precision and recall, expressed as follows: The false alarm rate refers to the percentage of normal traffic classified as abnormal traffic, as shown in the following equation:

2) ROC CURVE AND PR CURVE
Receiver operating characteristic (ROC) curve is a common metric of machine learning classification problems and can be used as a measure of classifier performance. This paper uses AUCROC to represent the area enclosed by the ROC curve and the x-axis. Precision-Recall (PR) curve is also used as a common evaluation metric for classification problems. ROC curve is not sensitive to class distribution, but PR curve can capture the impact of class distribution on model performance [29], [30]. In this paper, AUCPRC i is used to represent the area of PR curve of the model on class i. If the value of AUCPRC i is larger, then the model performs better on class i.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
First, the adjacency matrix of traffic flows is established according to the rules in Section III. The topological graph contains 51,534 nodes, which is the total number of training and test set samples. After calculation, 51,534 nodes create about 36.5 million undirected edges, and the number of elements with the value of 1 in the adjacency matrix is about 73 million, which are stored in the form of sparse matrix. During the training process, feature extraction of SGC layer and LSTM layer was carried out on the data of 35,068 nodes in the training set.

A. EXPERIMENTAL RESULTS
The loss value and the accuracy curve of the training set during 500 iterations of the SGC-LSTM model are shown in Fig.3 (a) and Fig.3 (b) respectively. It can be seen from Fig.3 (a) Fig.4, and the metrics are shown in Table 3. As can be seen from Table 3, after feature extraction of SGC-LSTM model, f 1 − score of normal class and abnormal class in test set are around 0.9, with little difference. The model has higher recall in normal class and higher precision in abnormal class. Fig.5 (a) and Fig.5 (b) are the PR curve of each class and ROC curve of Xgboost model on new test set extracted by SGC-LSTM.
In Fig.5 (a), the AUCPRC value of the model on all classes are above 0.95, and the model performs better on class 1 (abnormal class). Experimental comparison will be carried out later in this paper, firstly, the extracted features of SGC-LSTM model are compared with the features selected by feature selection method to verify the effectiveness of the proposed method, and then compared with other deep learning methods.

B. COMPARISON TO FEATURE SELECTION METHOD
In order to verify the effectiveness of the SGC-LSTM model in the feature extraction of network traffic data, the experiments in this part mainly compare the performance of SGC-LSTM with feature selection method. The feature selection method in the experiment uses the Sigmoid_PIO algorithm proposed by Hadeel et al. [31]. The feature subset selected by this method contains a total of 13 features, and the feature information is shown in Table 4.
The original training set and test set are input into the trained SGC-LSTM model, and the hidden nodes of the SGC layer and LSTM layer are extracted as new feature VOLUME 9, 2021  representations of the new training and test data. The t-SNE method [32] is used to map the new training set and test set features from high-dimensional to two-dimensional planes, and compare them with feature selection method. Fig.6 (a) and Fig.6 (b) are the new training and test set feature visualization results extracted by SGC-LSTM respectively. Fig. 7 (a) and Fig. 7 (b) are the feature subset visualization results of the original training set and test set under the feature selection method respectively. It can be seen from Fig.6 (a) that the normal class (green points) and the abnormal class (red points) overlap less on the two-dimensional plane, while there is a large amount of overlap between two class in Fig.7 (a). Similarly, there is significantly less overlap in Fig.6 (b) than in Fig.7 (b). Based on the four figures, the new training and test set data processed by SGC-LSTM have a more obvious distinction in features.
In order to compare the performance of SGC-LSTM and feature selection method from metrics, the comparison experiment in this section inputs the new features extracted by SGC-LSTM and feature subset constructed by feature selection method into Xgboost model for evaluation. Table 5 shows the comparison between the proposed SGC-LSTM and feature selection method on AUCPRC and AUCROC metrics. When Xgboost is used as the base classifier, the features extracted by SGC-LSTM are superior to the feature selection method in all metrics. Compared with the feature selection method, the SGC-LSTM method has improved about 1.5% in all metrics. The comparison results of accuracy, DR and FAR between SGC-LSTM and feature selection method are shown in Table 6. As can be seen from the table, compared with the feature selection method, the accuracy of the SGC-LSTM model is improved by about 5%. There is little difference between two methods in DR, but SGC-LSTM method is 53% lower than feature selection method in FAR.

C. COMPARISON TO OTHER DEEP LEARNING MODELS
This section compares the proposed method with other commonly used deep learning network traffic feature extraction models such as CNN, BiDLSTM and CNN-LSTM to verify the performance of the proposed method.
The CNN-LSTM model is currently a widely used model in the classification of normal and abnormal traffics. CNN is used to extract spatial features of network traffic flows, and LSTM is used to extract temporal features. The model structure is shown in Fig.8.
The CNN part of the CNN-LSTM model [33] includes a convolutional layer 1, a pooling layer 2, a convolutional layer 3, a pooling layer 4, and the final fully connected layer. The LSTM part that follows includes two layers. The CNN that is additionally compared in the experiment has the same structure with the CNN-LSTM model. In addition, this paper   also compares the proposed method with BiDLSTM [15], whose structure contains one embedding layer, two bidirectional LSTM layers, and two fully connected layers.
After extracting the new features of the original training set and test set from each model, the visualization results of the t-SNE method are shown in Fig.9. The performance of the four feature extraction models cannot be seen intuitively from the figure, and it needs to be compared with various classification metrics.
The base classifier still uses Xgboost, the metrics of four feature extraction models on the normal and abnormal class are shown in Fig.10 and Fig.11 respectively.
As can be seen from the above figure, the recall of abnormal class and the precision of normal class of SGC-LSTM are lower than CNN-LSTM model, however it is better than the other three models in other metrics. The CNN-LSTM method is better than CNN and BiDLSTM methods in all metrics. The overall performance of the BIDLSTM model is better than the CNN model with little difference. The comparison of AUCPRC and AUCROC results of each model is shown in Table 7. The CNN model mainly relies on multiple convolution kernels to extract the local spatial features of traffic flows, and the BiDLSTM model mainly relies on the memory unit to extract the temporal features. The performance of these two models is not much different on the three metrics in the table. The CNN-LSTM model has both CNN's spatial feature extraction capabilities and LSTM temporal feature extraction capabilities. Compared with CNN and BiDLSTM, CNN-LSTM has improved by about 0.6% on all metrics. However, the CNN model has certain limitations in non-image structure feature extraction. The SGC-LSTM model performs best on three metrics, and compared to the CNN-LSTM model, its metrics are improved by about 0.4%. Table 8 shows the comparison of accuracy, DR and FAR of VOLUME 9, 2021 these four feature extraction models on the test set. Compared with CNN and BiDLSTM, the CNN-LSTM method has better performance. In terms of DR, the CNN-LSTM method is better than the other three models, reaching 90.96%, which is about 1.2% higher than the SGC-LSTM method. However, in terms of accuracy and FAR, the SGC-LSTM method    performs best, and its FAR is reduced by about 43% compared to the CNN-LSTM method.

V. CONCLUSION
This paper studies the classification of network traffic, proposes the establishment rules of network traffic topology graph structure, and proposes a network traffic classification method based on graph convolution and LSTM. This method first processes the data with the graph convolution layer, extracts its spatial features, and then combines the LSTM model to extract its potential temporal features. On the sampled UNSW-NB15 data set, it is compared with feature selection and other commonly used deep learning methods (such as CNN, BiDLSTM and CNN-LSTM) to verify the performance and effectiveness of the proposed method. There are also some shortcomings and areas to be optimized in the experiment. When building a topological graph for network traffic data, the more the number of nodes, the more undirected edges are established, and the greater the amount of matrix operations involved, which is a big challenge to the memory size and computing power of the machine. This article provides an idea for using graph convolution model in network traffic environment, exploring the relationship between normal and abnormal traffic flows, and the correlations between traffic flows can be further explored in the future.