Network traffic classification model based on attention mechanism and spatiotemporal features

Traffic classification is widely used in network security and network management. Early studies have mainly focused on mapping network traffic to different unencrypted applications, but little research has been done on network traffic classification of encrypted applications, especially the underlying traffic of encrypted applications. To address the above issues, this paper proposes a network encryption traffic classification model that combines attention mechanisms and spatiotemporal features. The model firstly uses the long short-term memory (LSTM) method to analyze continuous network flows and find the temporal correlation features between these network flows. Secondly, the convolutional neural network (CNN) method is used to extract the high-order spatial features of the network flow, and then, the squeeze and excitation (SE) module is used to weight and redistribute the high-order spatial features to obtain the key spatial features of the network flow. Finally, through the above three stages of training and learning, fast classification of network flows is achieved. The main advantages of this model are as follows: (1) the mapping relationship between network flow and label is automatically constructed by the model without manual intervention and decision by network features, (2) it has strong generalization ability and can quickly adapt to different network traffic datasets, and (3) it can handle encrypted applications and their underlying traffic with high accuracy. The experimental results show that the model can be applied to classify network traffic of encrypted and unencrypted applications at the same time, especially the classification accuracy of the underlying traffic of encrypted applications is improved. In most cases, the accuracy generally exceeds 90%.

analysis, signature matching, and deep learning methods have also been applied and studied in network traffic classification [5]. Machine learning algorithms enable the classification of network traffic by training models on known categories. These models can automatically identify and classify new traffic. Common algorithms used for this purpose include decision trees, support vector machines (SVM), and Naive Bayes classifiers. They use packet features such as source and destination IP addresses, packet length, and timestamps for classification. A network traffic classification method based on Naive Bayes classification was proposed in [6]. The authors first preprocessed the dataset by removing irrelevant features and normalizing values. They then applied a feature selection algorithm based on information gain to identify the most important features in the classification task. The Naive Bayes classifier was trained on the selected features and evaluated on the test set using several metrics, including accuracy, precision, recall, and F1 scores. The study showed that Naive Bayes classification is an effective method for classifying network traffic, and that feature selection is crucial for improving the performance of the algorithm.
The traffic behavior analysis method classifies traffic by analyzing the behavioral patterns of traffic. It can detect abnormal traffic behavior and identify network attacks and unusual activities. For example, by monitoring frequent connection attempts or unusual data transfer volumes from specific IP addresses, it is possible to determine whether traffic is a malicious activity. Signature matching methods use predefined rules or patterns to match traffic. By matching it against the signature of a known network attack or malicious behavior, it can determine whether the traffic belongs to a specific type of attack. This method is commonly used in intrusion detection systems (IDS) and intrusion prevention systems (IPS).
Deep learning traffic classification is a method that uses deep neural network models to automatically classify network traffic. A deep learning-based method for packet-based network traffic classification was proposed in [7]. The authors used a convolutional neural network (CNN) to extract features from the packets and a multilayer perceptron (MLP) to classify the traffic. The method was evaluated on a real-world dataset and was shown to be more effective than other machine learning algorithms.
From the research in references [6] and [7], it can be seen that the former approach can identify certain unknown or encrypted application traffic but relies on prior knowledge. The latter method does not require manual extraction of traffic features and can automatically construct the mapping relationship between network traffic and corresponding labels, thus eliminating the dependence on prior knowledge [8].
In recent years, the use of encryption technology has increased significantly in network communications. This is to ensure the privacy of user data. However, this has also led to a significant increase in encrypted traffic, which challenges traditional rule-based methods to effectively identify and classify it. Encryption transforms communication content into random ciphertext, making it difficult to decipher in a short period of time. As a result, load-based methods are ineffective for accurate matching and detection.
Encrypted traffic ensures secure communication, but it also creates opportunities for malicious activities. This can lead to an increase in false positives or false negatives in existing inspection methods, which poses a significant challenge for network regulators in accurately identifying and managing encrypted traffic. For example, consider an organization that relies heavily on VoIP communication for its day-to-day operations. Employees are also allowed to access video streaming platforms during their break times. However, due to limited network resources, the organization needs to prioritize VoIP traffic to ensure high call quality. However, both VoIP and video streaming traffic often use similar network ports and protocols, making it challenging to differentiate them based solely on network packet header. Existing traffic classification methods are not accurate enough to differentiate between VoIP and video streaming traffic, which leads to suboptimal QoS and potential disruptions in VoIP calls.
In addition, due to the use of encryption technology, traffic content is converted into random ciphertext, making it difficult to directly check and analyze the content. This may result in traditional content-based malicious traffic detection methods being unable to effectively decrypt and identify some normal encrypted traffic, leading to false positives. For example, Dropbox is a file synchronization and sharing service that let users upload and download files. Dropbox uses encryption to keep user data private and secure. However, this traffic behavior pattern may be similar to some malicious activities (such as massive file transfer or abnormal file types), which may result in misreported malicious traffic.
In addition, Dropbox provides its services using shared IP addresses or domain names that may be associated with other malicious activity. When cybersecurity systems detect malicious traffic based on IP addresses or domain names, they may misclassify Dropbox traffic as malicious.
Finally, Dropbox users can upload and download large amounts of data, possibly with high frequency. Such large data transfers and frequent traffic can be misinterpreted as malicious behavior, especially if traffic-or frequency-based rules are used in network security systems.
From the problems faced by the first two network applications, it can be seen that there are still some challenging problems to be solved in dealing with encrypted traffic and application traffic classification, which are mainly as follows [9][10][11]: (1) Inaccurate traffic classification: Due to the complexity and diversity of application software, traffic can vary on different devices and versions of the same application. In addition, some applications may use encryption to hide their traffic, making it difficult to classify traffic. (2) Misclassification: Application traffic classification is usually based on a few simple rules, such as port numbers, IP addresses, and domain names. Therefore, misclassification may occur. For example, the traffic of some applications may be very similar to that of other applications, but their functions are completely different, which can be easily misjudged. (3) Complex encryption algorithm: The encryption algorithm is usually very complex, which makes it difficult to classify traffic. Some encryption algorithms may also use random numbers, hash functions, and other techniques, which increases the difficulty of classification. (4) Traffic noise problem: Since the encrypted traffic itself is encrypted, there may be a lot of noise in the encrypted traffic, which may affect the accuracy of traffic classification. (5) Attacks and spoofing: Some malicious attackers may use encryption to hide their attack traffic, which poses a challenge to the classification of encrypted traffic. At the same time, the attacker may spoof other types of traffic to fool the traffic classification system. (6) Traffic congestion: Users may use multiple applications at the same time, and the traffic of these applications will be mixed together, making it difficult to accurately classify and record the traffic of each application.
To address the aforementioned critical issues and improve the inspection accuracy of the original methods, the following three aspects can be considered: (1) Update malicious traffic detection rules: Network security systems can be updated with rules based on the traffic behavior patterns and features of legitimate business traffic. This can help to avoid false positives, where legitimate traffic is mistakenly identified as malicious.
(2) Perform comprehensive feature-based classification: Malicious traffic can be identified by considering multiple features, such as traffic behavior, data transmission patterns, IP addresses, and domain names. By considering multiple features together, it becomes possible to more accurately differentiate between normal business traffic and malicious traffic. (3) Use machine learning or deep learning methods: Machine learning and deep learning methods can be used to classify and identify normal business traffic. These methods learn the features and patterns of normal traffic to distinguish between normal traffic and malicious traffic.
This paper proposes an end-to-end representation learning network classification model. End-to-end representation learning is a deep learning-based approach that can directly map input data to output labels, thus avoiding the process of manual feature extraction of input data. For encryption and application software traffic classification, the solution of this paper is as follows: (1) Encrypted traffic classification problem: The traditional traffic classification method may need to decrypt encrypted traffic, which involves key management and privacy issues. However, the end-toend representation learning method can directly classify encrypted traffic without decryption, thus avoiding these problems. (2) Application software traffic classification problem: The complexity and diversity of application software make traffic classification inaccurate. However, the end-to-end representation learning method can automatically extract useful features by learning a large amount of traffic data, so as to achieve more accurate traffic classification. (3) Accuracy problem: The end-to-end representation learning method can improve the accuracy and precision of classification through multilevel neural network structure and a large number of training data, which is more efficient than traditional methods. (4) Universality problem: The end-to-end representation learning method can deal with various types of traffic data without manual feature extraction and rule design, so it has better universality and flexibility. (5) Adaptive problem: Due to the constant changes of application software and encryption algorithms, traditional traffic classification methods need to be updated constantly, while end-to-end representation learning methods can update the model adap- tively through incremental learning to adapt to new application software and encryption algorithms.
The brief steps of the end-to-end representational learning model proposed in this paper are as follows: (1) Sequential analysis with LSTM: LSTM is effective in capturing temporal dependencies in sequential data. It analyzes the order and timing of network packets, enabling the model to learn long-term dependencies and detect subtle patterns. This helps differentiate between normal and potential attacks, even in cases where similar ports or encryption techniques are used. This paper selects three datasets to evaluate the performance of the model and compares it with some methods in recent years. The results show that the model is more accurate and performs better than other methods in most cases in different classification experiments. The main contributions of this paper are as follows: (1) This paper proposes an end-to-end representation learning model which can automatically classify application software and encrypted network traffic. This method effectively solves the problem of coarse and fine granularity classification of application software traffic and the difficulty of accurate classification of encrypted traffic. (2) This paper applies the attention mechanism and representation method to network traffic classification to get rid of the bottleneck of information processing and improve the model capability. (3) This paper uses three datasets to verify the effectiveness of the model. The experimental results show that the method has higher detection accuracy and stronger generalization ability than other methods. (4) This model can be used to identify and classify network traffic generated by different applications, even if the traffic is encrypted. This approach works by capturing underlying patterns and features within the encrypted traffic, which helps to improve classification accuracy. In the context of enterprise network environments, this model can be used by network administrators to identify and differentiate traffic generated by instant messaging tools, BitTorrent, and other applications. This helps to reduce false positives, improve network security, and ensure compliance with enterprise network policies.
The rest of this paper is organized as follows. The second section is related work, which introduces the motivation and preparation of the experiment. The third section describes the specific method. The fourth section gives the experimental results and analysis. The fifth section summarizes the paper and prospects the future research.

Traditional network traffic classification methods
Traditional network traffic classification refers to the method of classifying and identifying traffic by analyzing the packet features of network traffic. There has been a lot of research in this field, mainly including the following aspects: (1) Protocol-based classification method: This is one of the earliest network traffic classification methods, which classifies traffic by identifying the protocol identifier in the packet header. This method is simple and fast but susceptible to deception and attacks. (2) Port-based classification method: This method classifies traffic based on port numbers, identifying the application type by determining the source and destination port numbers of the packet. This method is also simple and fast but vulnerable to port deception attacks. In the early research of network traffic classification, port-based methods were widely used in practical network business. However, these methods have lower recognition accuracy for applications with dynamic ports [12,13]. Lim et al. fully proved that port-based methods can effectively identify applications that follow the port registration rules [14]. Most researchers now use hybrid methods that mix port-based methods with other methods to improve detection accuracy. Lu et al. proposed a hybrid method that first classifies flows into corresponding applications by packet size distribution and then groups flow into sessions by port location [15].
Another direction for improvement is deep packet inspection (DPI), a traffic detection and control technology based on the application layer. When IP, TCP, or UDP data flows pass through a DPI system, the system extracts a feature library by in-depth analysis of the packet payload. In the traffic identification process, DPI matches the network flow load to the rules in the feature library. If the match is successful, the protocol corresponding to the rule is identified. Bujlow et al. conducted a comprehensive comparison of 6 commonly used DPI tools, including 2 commercial products (PACE and NBAR) and 4 open-source tools (OpenDPI, L7-filter, nDPI, and libprotoident). The test comparison results show that the PACE commercial tool has the best detection performance among the six tools, but some open-source tools, such as nDPI and libprotoident, can also achieve very high accuracy [16].
As the rapid development of network applications, some new applications no longer follow simple port registration rules, and some are encrypted. Therefore, both port-based and DPI-based methods cannot directly analyze network traffic [17].
In recent years, machine learning methods such as SVM, Bayesian, KNN, and neural networks have been widely applied in network traffic classification with some success. However, existing flow correlation methods based on passive flow analysis technology have problems of high storage and huge computational overhead. Hu et al. [18] proposed a novel flow correlation method based on compressed sensing-neural network. This method takes the traffic features after dimensionality reduction as the input of the convolutional neural network, extracts the correlation features through the convolutional neural network, and then uses the oneclass SVM classifier to judge the correlation.
As can be seen from the above, machine learning methods have been applied to traffic classification, but their performance depends on the correct selection of traffic features and manual decision-making. This means that they cannot automatically adapt to new changes in the network. In contrast, deep learning methods generally have strong self-learning capabilities and do not require human intervention during model training. As a result, more and more researchers have begun to pay attention to and use deep learning methods to classify network flows. At present, the main popular deep learning models are convolutional neural networks (CNNs), recurrent neural networks (RNNs), and so on. These models have made great achievements in network security [19], computer vision [20], natural language processing [21,22], speech recognition [23,24], and other fields.
Wang et al. [9] proposed a malware traffic classification method using CNN. This method does not require manual feature engineering but directly uses raw network traffic as input to the classifier. It is the first attempt to apply representation learning to malware traffic classification, and the results of related experiments are promising. However, the authors also point out two potential limitations of the method: (1) The generalization ability of the method needs further verification. (2) The method only considers the spatial features of network traffic and ignores temporal features.
Li et al. [25] introduced RNN to network traffic classification. In this method, network datagrams are divided into several byte segments, which are then fed to the RNN for training and learning. Finally, the softmax function is applied to output the traffic type. The authors believe that this method has several advantages over traditional machine learning methods: (1) No prior knowledge of the target application is required. (2) Different protocols and multi-class operations can be handled.
To overcome the limitations of single deep learning models, researchers have begun to explore the use of multiple deep learning models in combination. For example, RNN and LSTM can be combined to analyze and refine network characteristics from multiple angles. Currently, researchers are exploring how to combine and optimize different classification methods to achieve more accurate and efficient traffic classification and identification. Additionally, researchers are also constantly exploring and researching new methods and technologies to address the challenges of classifying new applications and encrypted traffic.

Encryption and application traffic classification methods
With the widespread use of encryption and application software, network traffic classification faces new challenges. In response to these challenges, researchers have conducted a series of related work.
(1) Encryption traffic classification: Traditional network traffic classification methods cannot accurately identify the type of encrypted traffic due to the difficulty of decrypting and analyzing encrypted traffic. Researchers have proposed encryption traffic classification methods based on traffic statistical features and machine learning algorithms, such as Hidden Markov models (HMM) and collaborative decomposition algorithms. (2) Application software traffic classification: The classification of application software traffic is subjective and complex. Researchers have proposed a number of methods to overcome these problems, such as host behavior, user behavior, and deep learning. (3) End-to-end representation learning: End-to-end representation learning is a new traffic classification method that learns the end-to-end representation of network traffic to achieve traffic classification and identification. This method can overcome the problem of manual feature selection and extraction in traditional methods and has higher accuracy and flexibility. (4) Network traffic classification platforms: Researchers have developed a number of network traffic classification platforms, such as OpenDPI, L7-filter, and DPI-LIB, to facilitate and accelerate traffic classification research. These platforms provide convenient traffic classification tools and datasets, which can help researchers to conduct traffic classification research more quickly.
Network encrypted traffic classification is a technique for identifying and classifying encrypted traffic. Many researchers have attempted to solve this problem using different methods. Here are some related work introductions: (1) Deep learning-based encrypted traffic classification: This method employs deep learning models to automatically extract features from encrypted traffic and subsequently classify it. By utilizing models such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), the transmission characteristics of encrypted traffic can be fed as inputs to identify and categorize the traffic. Shapira et al. [26] introduce a novel method for encrypting Internet traffic. Their approach involves converting elementary stream data into images and then applying a CNN technique to identify traffic categories (e.g., browsing, chatting, video). Lotfollahi et al. [11] attempt to distinguish between encrypted and nonencrypted traffic by combining a stackable autoencoder with a CNN method. While this approach performs well on ISCX data streams, it struggles to identify subcategories of network streams when tested on Tor and YouTube data streams. (2) Statistical analysis-based encrypted traffic classification: This method employs statistical analysis techniques to examine the properties of encrypted traffic and extract valuable information for classifying it. For instance, Bayesian classifiers or support vector machines can be utilized to categorize encrypted traffic. Ammar Almomani proposed a system for analyzing and classifying VPN and non-VPN traffic using a new machine learning classifier called stacking ensemble learning. This approach was applied for the first time to a VPN and non-VPN attack problem. By combining predictions from multiple learning mechanisms (random forest, neural network, and support vector machine), ensemble learning was employed to enhance prediction accuracy [27]. (3) Feature extraction-based encrypted traffic classification: This method classifies encrypted traffic by extracting and analyzing its features. For example, wavelet transform can be used to extract time-domain and frequency-domain features of encrypted traffic, and then a classifier can be employed to categorize the traffic. Okada et al. [28] investigated the impact of encryption on traffic features. They created a training dataset contain-ing HTTP, FTP, SSH, and SMTP application protocols encrypted using PPTP and IPsec tunnels. The authors evaluated 49 traffic features and analyzed which ones had strong correlations in normal and encrypted traffic. They then used the correlated features to infer the function that transforms features between normal and encrypted traffic. Consequently, standard classifiers can be used to classify transformed traffic. The authors employed the Naive Bayesian classifier and made several modifications to validate their approach. (4) Rule-based encrypted traffic classification: This method uses predefined rules to determine the type and purpose of encrypted traffic. For example, the type of encrypted traffic can be ascertained by analyzing the features of the TLS handshake protocol. TLS is an encryption protocol that provides privacy for applications and is typically used to encapsulate common application layer protocols, such as HTTP and SMTP protocols. For instance, for SSL/TLS traffic, the following rules can be used to classify it: • If the handshake message contains an RSA key exchange algorithm, the traffic type is RSA. • If the handshake message contains an ECDHE key exchange algorithm, the traffic type is ECDHE. • If the handshake message contains an AES encryption algorithm, the traffic type is AES. • If the SHA digest algorithm is included in the handshake message, the traffic type is SHA.
These rules can be defined through an understanding of the TLS protocol and can be implemented through software to identify the type of encrypted traffic. Rule-based encryption traffic classification methods can quickly identify encryption traffic types, but their disadvantage is that they cannot adapt to new encryption protocols or algorithms, requiring constant updating and maintenance of rule tables.
For example, Wei et al. [29] proposed the HNNIM (hybrid neural network identification model) model to identify malicious TLS traffic. The goal is to address the issue that classical machine learning methods are heavily influenced by expert experience, resulting in suboptimal identification and classification outcomes. The HNNIM model combines plaintext information from the TLS protocol's handshake phase and the TCP protocol's header field information, reducing reliance on expert experience and effectively improving the identification and classification of malicious TLS traffic. Korczyn´ski and Duda [30] proposed a Markov-based method for detecting anomalous encrypted communication by extracting fingerprints from the payload of data packets in TLS/SSL sessions to identify encrypted application traffic and then modeling the TLS/SSL message type sequence using a first-order Markov chain to detect anomalous encrypted communication. This method is applicable to unidirectional communication from the server to the client for a given application but requires upgrading and updating of the application and periodic updating of the fingerprints.
As new applications and services emerge, network traffic becomes more complex and diverse, making application-level traffic classification increasingly important and popular. However, existing methods for traffic classification often have limitations in achieving acceptable realtime performance.
Kyu-Seok Shim et al. [31] proposed a new method for application-level traffic classification that utilizes a sequence of payload sizes to generate unique signatures for each application. By analyzing the packet order, direction, and payload size of the first N packets in a flow, this method identifies application traffic with high accuracy and completeness rates, over 95% and 93%, respectively.
Jae-Hyun et al. [32] also used payload size sequence (PSS) signatures to classify application-level traffic. PSS signatures represent unique flow patterns for each application, which can be used to differentiate between applications. PSS signatures are generated for each application using statistical information of flows obtained from application traffic traces. This method can easily and quickly classify application traffic in real-time networks by matching the PSS signatures of new flows to those of each application.
Although some achievements have been made in application traffic classification, one of the main problems facing encrypted application traffic classification is encryption. Since the packet payload is encrypted, the traditional feature-based traffic classification method cannot be directly applied to encrypted traffic. Secondly, the characteristics of encrypted traffic and ordinary traffic are very different, so special techniques and algorithms are needed to distinguish them. In addition, as encryption technology continues to evolve and update, traffic classification methods need to be constantly updated and improved to maintain effectiveness. Finally, the number and variety of encryption applications are increasing, and the accuracy and scalability of classification methods need to be improved constantly.
End-to-end representation learning refers to the process of learning high-level features or representations from raw data using neural networks. In the field of traffic classification, end-to-end representation learning can automatically learn the characteristics of traffic without relying on manually defined characteristics.
DeepPacket and DeepFlow are traffic classification methods that utilize deep representation learning. In  [33], the authors employed a convolutional neural network (CNN) and a stacked autoencoder (SAE) to classify encrypted traffic data. They improved model performance using data augmentation and SAE learning techniques and demonstrated that DeepPacket outperformed traditional rule and feature-based methods. Similarly, in [34], the authors used CNNs and recurrent neural networks to classify data packets. They enhanced performance with data augmentation, transfer learning, and deep supervision and showed that DeepFlow outperformed traditional methods. In [35], the authors summarized various deep learning methods used in network traffic classification, including CNNs, recursive neural networks, and autoencoders. Experimental results indicated that deep learning methods outperformed traditional rule and feature-based methods and performed well in different types of traffic classification tasks.
In summary, traditional machine learning methods typically require feature extraction from network flows and classification using machine learning and rule definitions. The effectiveness of network traffic classification depends on feature extraction and rule definition and requires trustworthy public network traffic datasets for comparison. With the increasing popularity of encryption technology, analyzing encrypted network traffic effectively has become an urgent problem to solve. The complexity of network services also requires continuous exploration of how to classify different types of network flows. Although representation learning has made great progress in network traffic classification, there are still some outstanding issues to be addressed, including the following: (1) Classification of unknown traffic: Representation learning algorithms typically require a large amount of training data to generate effective feature representations. However, in practical networks, there may be some unknown traffic types that cannot be properly classified. Therefore, how to classify unknown traffic is still a challenge.

Proposed solution
Deep learning has been proven to be very effective in network traffic classification tasks. However, the increasing diversity of network traffic and encrypted traffic demands continuous improvement and optimization of these models. In order to further improve the classification accuracy of encrypted and network application traffic, this paper proposes the use of multiple deep learning models to enhance network traffic classification. The paper also considers aspects such as dataset, feature selection, model optimization, and model fusion. The paper emphasizes the importance of dataset quality and diversity and the need to collect valid data. For different types of traffic, appropriate end-to-end representation learning methods should be used. Using various model optimization techniques, such as adaptive learning rate, dropout, and batch normalization, can improve model performance. In addition, using multiple model fusion methods, such as voting, weighted averaging, and stacking, can further improve model performance.
The comprehensive use of these technologies and methods can effectively improve the accuracy and generalization ability of network traffic classification, especially in the area of encrypted and network application traffic classification.
In this section, a deep learning-based spatiotemporal correlation network flow classification model is proposed. The model combines the advantages of convolutional neural network (CNN) and long short-term memory (LSTM). The model framework is shown in Fig. 1.
To automatically and effectively extract and represent the spatiotemporal features of network flows, the proposed model performs the following steps: (1) The model utilizes LSTM to extract the temporal features of the network flow.
(2) The model applies the squeeze-and-excitation (SE) mechanism to optimize the CNN network structure and improve its training performance. LSTM is a type of neural network with memory function and is well-suited for processing temporal data. Given that network flow is a typical time series data, LSTM is a suitable choice for training the model. CNN, on the other hand, has a strong ability in image feature extraction. It usually comprises three main components: convolution layer, pooling layer, and full connection layer. The convolution layer convolutes the local region of the input data with the convolution kernel, while the pooling layer reduces the dimension of the training characteristics. The fully connected layer is a traditional multilayer perceptron, often used as an output.
The CNN-SE-net is achieved by integrating the SE module with the traditional CNN network. The SE module is an attention mechanism that can be embedded in other classification or detection models. Its core idea is to learn feature weights based on the loss function through the network, so that the effective feature map has a larger weight, while the ineffective feature map has a smaller weight.
The detailed implementation of the model is shown in Sect. 3.3.

Datasets
A good dataset is a necessary condition for verifying the correctness of a method. Currently, researchers mainly rely on some well-known network attack datasets (such as KDD-CUP99 [36], NSL-KDD [37], and UNSW_NB15 [38]) to test network traffic classification methods. For example, Gao et al. [39] used the KDD-CUP99 dataset to validate their method of combining multiple Boltzmann machines and back-propagation algorithms to classify network flows. Shone [40] and Zhang et al. [19] tested the intrusion detection systems they developed on the NSL-KDD and UNSW_NB15 datasets, respectively.
Although these datasets contain some classic examples of network attacks, some of them are outdated and cannot adapt to new network scenarios. As time goes on, network traffic and attack types and patterns are constantly changing. Therefore, evaluating intrusion detection methods using outdated datasets may lead to evaluation results that do not match the actual situation.
To address this problem, researchers can take the following measures: (  In summary, as network attacks and traffic continue to evolve, evaluating the relevance of intrusion detection methods becomes increasingly difficult. To address this issue, researchers need to take innovative approaches to create new datasets, update existing datasets, develop more generic evaluation frameworks, and use mixed datasets.
In order to verify the comprehensive ability of the proposed method in this paper, multiple different datasets were used, including three commonly used datasets: ISCX VPN-nonVPN [41], USTC-TFC2016 [42], and the YouTube dataset [43]. These three datasets contain a large amount of encrypted, unencrypted, abnormal, and normal traffic, and the proposed method was thoroughly tested from both binary and multi-class perspectives.
The ISCX VPN-nonVPN dataset consists of both VPN and non-VPN network traffic, providing a means to evaluate the performance of intrusion detection systems and network security algorithms. VPNs employ encryption and authentication to safeguard data communication, offering enhanced network security. This dataset encompasses real-world network environments and encompasses various common network protocols such as HTTP, SSH, FTP, SMTP, and DNS. Additionally, it includes instances of common network attacks like DDoS attacks, port scanning, and malware propagation.
Widely utilized in academic research and practical applications, the ISCX VPN-nonVPN dataset serves as a valuable resource for studying network attack detection, intrusion detection algorithms and tools, and testing defense performance. The Canadian Cybersecurity Institute stores this dataset in PCAP format, with network traffic categorized into 12 types based on protocol type, including chat, email, file transfer, streaming, torrent, VoIP, and more ( Table 1).
The USTC-TFC2016 dataset, released by the Security Laboratory of the University of Science and Technology of China, comprises video traffic data collected from real-world scenarios covering multiple video applications and network protocols. Captured using packet capture technology between March and June 2016, the dataset includes various application scenarios spanning local networks and the Internet. It encompasses video protocols such as HTTP, RTSP, UDP, RTP, SIP, Skype, and QQ video. The USTC-TFC2016 dataset serves as a vital resource for research on video traffic analysis, video content recognition, traffic classification, network application performance evaluation, and related fields. It is widely recognized as a significant video traffic dataset. Table 1 illustrates that the dataset encompasses 20 network flows, consisting of 10 normal flows and 10 abnormal flows.
The YouTube dataset comprises 100 encrypted video streams from Chrome, with each video viewed 100 times. These videos' titles are derived from current popular topics, including news, sports, nature, and more. For example, in the directory http:// www. cse. bgu. ac. il/ title_ finge rprin ting/ datas et_ chrome_ 100/ Holly weezy/, there are 100 PCAPs, with "Hollyweezy" representing the video title. The dataset also includes some data packets with delays and packet loss for testing purposes.
The purpose of this dataset is to develop a model capable of identifying video titles associated with encrypted video traffic. In order to validate the proposed method's ability to accurately identify encrypted network flows, this publicly available dataset is employed. Due to the large number of source data samples and the space they occupy, only 10 samples are selected for testing and evaluation in this experiment, with a total PCAP stream size of 3.24 GB.

Preprocessing
Since the original network packets in PCAP format cannot be directly used as the input of this model, this paper refers to the literature [9] to preprocess the dataset for the PCAP packets. The specific steps include the following: traffic filtering, image generation, and IDX format conversion.

Step 1 (traffic filtering)
Since a session is a bidirectional network flow, it contains more abundant information than a unidirectional flow. Therefore, this paper adopts the traffic classification method based on the session mode. This step splits the PCAP-formatted raw packets into individual sessionlevel packets and then cleans the traffic to remove empty and duplicate files that affect model training.
In this step, this paper uses the SplitCap [44] tool to split the network flow from flow to session level. Split-Cap is a free and open-source PCAP file splitter. Split-Cap splits a large PCAP file into multiple files based on TCP and UDP sessions, one PCAP file per session. The TCP and UDP session concepts in SplitCap are defined as bidirectional streams, i.e., all frames/packets with the same 5-tuple (source host, destination host, source port, destination port, transport protocol).
From the protocol level analysis, the traffic characteristics are mainly reflected in the application layer. For example, the STMP protocol stands for mail traffic and HTTP for browser traffic. If the relevant network features are only obtained from a single PCAP packet, the entire process of the network session cannot be accurately reflected. By using session-based traffic analysis, new implicit statistical features can be obtained, such as blocking window size, out-of-order segments, and network forward and backward flow byte sizes.

Step 2 (image generation)
To facilitate the processing of the CNN training model, the following steps are taken. Firstly, the cleaned files are standardized to a length of 784 bytes. If a file is longer than 784 bytes, it will be truncated, and if it is shorter, it will be padded with 0 × 00 bytes. Each byte in the original packet represents a pixel in the image, resulting in a conversion of 784 bytes of stream data into a 28 × 28 image matrix. Each image has three parameters: height (H), width (W), and channel (C).
To convert PCAPs/flows into a 784-byte image, the following steps are involved: (1) Obtain PCAPs/flows data: PCAPs are packet capture file formats, while flows are a data represen-tation method based on network traffic statistics. Tools such as Wireshark and Tcpdump can be used to collect this data. (2) Preprocess the PCAPs/flows data: Since PCAPs/ flows data is typically large, preprocessing is necessary to reduce the data volume and extract relevant information. This involves filtering out unnecessary packets, extracting packets within a specific time period, and removing irrelevant data based on specific requirements. (3) Feature extraction: Useful features such as packet length, source IP address, destination IP address, source port number, destination port number, and transport layer protocol type need to be extracted to convert PCAPs/flows data into an image. The selection of features depends on the specific task at hand, such as detecting malicious traffic by extracting traffic direction and duration. (4) Feature encoding: The extracted features need to be encoded, which can be achieved through techniques like one-hot encoding or embedding encoding. The choice of encoding method should consider its impact, such as one-hot encoding resulting in high-dimensional vectors with feature independence, while embedding encoding map features to a lower-dimensional space but may lead to information loss. (6) Optional post-processing steps can be performed on the generated image, such as image enhancement or denoising, to improve the image quality.

Step 3 (IDX format conversion)
The processed data is converted into the IDX file format, which serves as the input for the LSTM and CNN networks [45].
To ensure clarity in the preprocessing process, let us use an example from the YouTube dataset, specifically the flow titled "Maroon 5_Sugar. " Fig. 2 showcases a selection of original flows downloaded from the network in this study. The IDX format is commonly used for storing large multidimensional arrays or tensors efficiently. By converting the preprocessed data into IDX files, it becomes compatible with the LSTM and CNN networks, enabling further analysis and classification tasks. It is worth noting that IDX format conversion provides a structured and standardized representation of the data, facilitating seamless integration with the chosen network models for subsequent processing and analysis.
The first PCAP network flow consists of a total of 63,855 TCP records. Visual representation of these records can be achieved using tools like Wireshark, as depicted in Fig. 3. While Wireshark allows for basic information retrieval, such as source IP address and source port, it does not provide statistical insights into network sessions.
After completing step 1, subflows in session-level PCAP format are obtained, as illustrated in Fig. 4.  Furthermore, comprehensive network communication information can be obtained by opening a session PCAP packet in Wireshark, as depicted in Fig. 5. This includes details of TCP flow establishment, data transmission, and release, along with various relevant statistical features that can be calculated. After the processing of step 2, the network session flow is converted into a 784-byte image, which serves as the input for the CNN network. The processed images of the three datasets are displayed in Figs. 6, 7, and 8, respectively. It is evident that most of the images exhibit distinguishable texture features, while only a few share similarities, such as FTP and SMB. The traffic visualization results indicate clear distinctions between images representing different types of traffic, demonstrating the feasibility of using session flow-generated images for traffic classification.

LSTM
Network traffic is the language of communication between computers, which is transmitted in the form of sequence in the network and contains rich time-related information. In order to comprehensively analyze the network traffic and extract its temporal correlation features, this paper adopts the LSTM method.
LSTM networks are a special type of RNN that can learn long-term dependencies. LSTM was proposed by Hochreiter and Schmidhuber in 1997 [46]. In many problems, LSTMs have achieved considerable success and are widely used. The structure of the LSTM model used in this paper is shown in Fig. 9. The long-and short-term memory network consists of several long-and short-term memory units, which are composed of linear units and a self-connection with a constant weight of 1.0. This allows a value (forward pass) or gradient (backward pass) to flow into this self-looping unit and be saved and retrieved after the desired time step.
The 784-byte data obtained from the preprocessed network stream will be converted into a matrix with a value between 0 and 255, and then, this matrix will be generated into a 28 × 28 × 1 single-channel image. These images will be sent to the LSTM model to extract the time series features contained in the network flow. The internal structure of LSTM is more complex, and the core is the unit state flow shown in Fig. 9. The unit state flow is controlled and adjusted by three gate mechanisms: forgetting gate ( F t ), input gate ( I t ), and output gate ( O t ).
The output of the previous cell h t−1 and the input data of the current cell x t are entered into the forgetting gate at the same time to obtain the information retention degree of the previous hidden layer, and the value is F t . where σ refers to the sigmoid function, W f is the weight matrix of the forgetting gate, and b f is the bias value. The input gate calculates I t and C t ′ to determine the new data and the extent to which it needs to be retained. (1) In the above formula, W i and W c are the weight matrix of the input gate, and b i and b c are the bias value. Finally, according to the calculation results of the input gate and the forgetting gate, the output gate obtains the next output results h t and O t .
where σ refers to the sigmoid function, W o is the weight matrix of the output gate, and b o is the bias value.

CNN-SE-net
CNN is an excellent deep learning model for image processing, capable of automatically extracting features from high-dimensional data using shared convolutional kernels, without encountering significant computational challenges. However, researchers have identified certain limitations when employing a single CNN network. Firstly, as the network depth increases, modifying parameters through backpropagation can lead to slower changes in parameters near the input layer. Secondly, using gradient descent algorithms may cause the training process to converge to local minima instead of the global minimum. Lastly, the pooling layer may result in the loss of valuable information and disregard the correlation between individual parts and the overall context. Consequently, it is not feasible to directly employ a single CNN model for network flow training. To enhance the accuracy and efficiency of CNN models in network traffic classification, this study proposes optimizations from two perspectives.
Firstly, during CNN model training, the large convolutional kernel's receptive field can be simulated by stacking multiple 3 × 3 convolutional kernels. For instance, three 3 × 3 convolutional kernels can be stacked to achieve a receptive field equivalent to a 7 × 7 convolutional kernel. This approach increases the network's depth and search space, reduces the number of model parameters, and enhances overall performance [47]. Additionally, selecting a smaller stride value during model training prevents the loss of detailed information associated with larger strides. Therefore, this study sets the model's stride parameter to 1.
Secondly, the self-attention mechanism of the SE module [48] is employed to extract spatial and channel information while recalibrating the interdependence among feature map channels. The SE module generates modulation weights based on the global information of the feature map, enhancing or suppressing different channels based on specific classification tasks.
The attention mechanism, also referred to as "neural network attention, " comprises three steps: information input, calculation of attention distribution, and processing of input information based on the calculated attention distribution. By incorporating the selfattention mechanism into the CNN model, this study effectively captures spatial and channel dependencies, leading to improved performance in network traffic classification tasks.
Let a ∈ R d be the input vector, X = [x 1 , x 2 , · · ·, x n ] be N input samples, q ∈ R k be the query vector or feature vector, and Z ∈ [1, N ] be the attention variable, which indicates the position of the selected information. For example, z = i means the i-th input vector is selected.
The general attention mechanism is divided into soft attention and hard attention. The formulas of soft attention mechanism are generally as follow: where a i is called attention distribution and s(x i , q) is the attention scoring function. Attention distribution a i can be interpreted as the degree of attention of the i-th input vector for a given query q . The soft attention selection mechanism is to aggregate them.
Hard attention selects information based on maximum sampling or random sampling. Among them, the formula for selecting the input information with the highest probability is as follows: The specific implementation of embedding SE in CNN is shown in Fig. 10.
a i X i (9) att(X,q)=X j, where j= argmax (a i ) Fig. 9 The structure of the LSTM model After preprocessing, the original network flow is input into the LSTM module for time series analysis, and the output X = [x 1 , x 2 , ..., x c ] is obtained. After the convolution operation, the output isU The convolution operation is shown in formula 10.
where * denotes convolution and V = [v 1 , v 2 , . . . , v c ] is C convolution kernels of 3 × 3. The first three convolution layers of network use 32, 64, and 64 convolution kernels respectively, and the channel number of the feature graph is correspondingly converted to the following: 1-32-64-64.
After that, the fourth network layer is the maximum pooling layer with 2 × 2 and step size of 2, and the out- SE module is embedded in the fifth layer of the model. SE module is a channel-based attention model with simple deployment and small amount of calculation [39]. In this paper, the SE module is embedded in CNN to enhance or suppress the feature channel through weight, so that the model focuses on the spatial features with greater importance.
The specific operation is further subdivided into three main steps: global average pooling F sq (U ) , weight generation F ex (w * , W ), and weight redistribution F sca u c , w ∼ c .
(1) Global average pooling The feature map output by the maximum pooling layer has three dimensions, namely width, height, and number of channels. First, the global average pooling operation compresses the width and height directions, so that the width and height dimensions are reduced to 1 × 1, but the number of channels remains unchanged.
The feature map output in this step is w * = w * 1 , w * 2, . . . , w * C ,w * 1 ∈ R 1×1×C . The feature map of 1 × 1 × C has a global perceptual domain, and the specific calculation of w * is shown in formula 11: (2) Weight generation Two fully connected layers are used to reduce and increase the dimension of channel C and increase the nonlinear relationship between channels. This operation produces a weight representing the importance of a set of global information The details are described in the formula 12.
where W 1 ∈ R C r ×C , W 2 ∈ R C× C r , r is reduction ratio of dimensionality-reduction layer. (

3) Weight redistribution
Finally, the output U ∈ R H ×W ×C of the maximum pooling layer is multiplied by the feature channel weight W ∼ = [w ∼ 1 , w ∼ 2 , . . . , w ∼ c ] and the output of CNN- It can be seen that the importance of each channel has changed after weight redistribution.
The output from the SE module undergoes another round of processing through three convolution layers, each utilizing small 3 × 3 convolutional kernels. These convolution layers are configured with 32, 32, and 16 convolutional kernels, respectively, resulting in a channel conversion of 32-32-16. Subsequently, the model is trained with a maximum pooling layer of 2 × 2 dimensions and a stride of 2. This is followed by two fully connected layers, one with a dimension of 1024 and another with a dimension matching the number of traffic classes. Finally, the softmax layer is employed to produce the ultimate classification results.

Basic performance test
In this paper, three international public network traffic datasets are chosen for several experiments: ISCX VPN-nonVPN, USTC-TFC2016, and YouTube. These datasets consist of raw PCAP flow files, which are described in Sect. 3.1. After the preprocessing detailed in Sect. 3.2, each PCAP flow is transformed into a 28 × 28 × 1 image. The experimental parameters are outlined in Table 2.
For evaluating the proposed detection scheme, Python, Scikit-learn, NumPy, Pandas, TensorFlow, and Keras machine learning libraries are utilized in this study. Initially, 90% of the samples are randomly selected as the training set, while the remaining 10% serve as the test set. To address potential overfitting issues, a dropout layer is introduced after the first fully connected layer. Crossentropy and Adam are employed as the loss functions and optimizers, respectively. During the training phase, the ReLU and softmax activation functions are predominantly used.
Dropout is a regularization technique commonly employed in convolutional neural networks (CNNs) to mitigate overfitting. By randomly deactivating neurons during training, dropout reduces the complexity of the neural network and fosters independence among neurons. This, in turn, enhances the network's ability to generalize and make more robust predictions. In these experiments, a dropout value of 0.7 indicates that 70% of the neurons in each layer are randomly ignored or "dropped out" during each iteration of training. This technique effectively reduces interdependence among neurons and helps prevent overfitting, allowing the network to learn more generalized features. During Table 2 The parameters of model

Parameter Values
Learning rate 10 −2 , 10 −3 , 10 testing, all neurons are utilized to ensure optimal performance by activating the entire network for making predictions.
To assess the impact of this method and provide an objective comparison with other approaches, the experimental results in this paper are evaluated using metrics such as F1 score, recall, accuracy, and precision. From the experimental results in these figures, it is evident that as the training epochs increase, the model's detection accuracy improves, gradually approaching 100%. Simultaneously, the loss rate decreases, progressively approaching 0%.
Furthermore, the figures demonstrate that in the selected experiments, the model generally achieves optimal results after 200 training iterations. The model exhibits a rapid convergence speed, effectively reducing the consumption of computing resources through fewer iterations of training.
To further assess the overall performance of the model and its ability to accurately classify different network traffic types, the paper conducted detection on the network applications within the three datasets. The detection results are presented in Tables 3, 4, and 5.
The results indicate that the precision, F1 score, and recall values all exhibit high performance. In Fig. 3, the average precision is 97.54%, the average F1 score is 97.61%, and the average recall is 97.72. Moreover, in Figs. 4 and 5, the average values for these three indicators also exceed 96%. These high values obtained from the detections provide strong evidence that the model proposed in this paper is capable of effectively classifying network traffic across different datasets and application types.

Comparative experiment
To further validate the performance of the proposed model, the paper conducted comparative experiments with several existing methods on the first three datasets. DraperGil et al. [10] employed the C4.5 machine learning technique to classify traffic in the ISCX VPN-nonVPN dataset. Wang et al. [1] proposed the use of multichannel LeNet-5, a variant of LeNet-5 CNN architecture, for network traffic classification. This method demonstrated better performance in classifying VPN traffic compared to the C4.5 method. Lotfollahi et al. [11] utilized a combination of SAE and 1DCNN for network flow classification. Dubin et al. [49] employed the KNN method for flow classification.
Using the same training and test datasets, the proposed method was compared with the aforementioned  Tables 6, 7, and 8. From the results in Table 6, the C4.5 method did not perform well in detecting both VPN and non-VPN flows, with a precision value lower than 85%. The other two CNN methods showed improvements in detecting VPN flows, achieving precision values exceeding 92%, but the improvement in non-VPN flow detection was not significant. The proposed method in this paper demonstrated good performance in detecting both VPN and non-VPN flows by extracting and optimizing the spatiotemporal features of network flows, with precision, recall, and F1 score values exceeding 95%.
Additionally, in Table 6, since the original literature for the C4.5 and ID CNN methods did not include the F1-score indicator, the corresponding values are indicated with "-" in Table 6.
Regarding the results in Table 7, the multichannel LeNet5 method showed strong performance in the USTC-TFC2016 dataset, achieving a detection rate of 97% for benign flows and over 98% for malware flows. Similarly, the proposed method in this paper yielded excellent results on this dataset. Although the overall indicators were slightly weaker than the former method, all three indicators for detecting the two types of flows surpassed 98%.
In the experiments conducted on the YouTube dataset (Table 8), the literature [40] primarily focused on precision values. Therefore, this paper only compared the precision value with the KNN method. The experimental results showed that the proposed method outperformed the KNN method in identifying YouTube video flows. It was capable of distinguishing between encrypted and non-encrypted flows, with a detection precision value exceeding 96%.
To evaluate the model's performance during training, tests were conducted on the ISCX VPN-nonVPN dataset. The training involved 12 different network streams, using epoch = 5000 and learning rate (LR) values ranging from 0.0001 to 0.01. The objective was to observe changes in model accuracy with varying numbers of training rounds and LR values, aiming to identify the optimal LR value. Figure 14 illustrates the accuracy variations of the model with different LR values. It can be observed that a LR of 0.0001 led to slow convergence, as the model's accuracy failed to stabilize even after 300 epochs of   training. This is due to the learning rate being a hyperparameter that adjusts the weight degree of the network based on the loss function. A small LR value can result in longer convergence times. In contrast, setting the LR to 0.01 caused significant oscillations in the learning rate curve, indicating that the LR was too high. This caused the gradient to swing in a region near the minimum value during training, making it difficult for the model to converge.    The experiments revealed that the model achieved the fastest convergence with a learning rate of 0.001. This LR value produced a small range of oscillation in the accuracy curve and resulted in the highest accuracy rate after convergence. Therefore, a learning rate of 0.001 was selected for this study. In summary, the experiments demonstrated that the optimal LR value depends on the dataset and model architecture. Careful tuning of the LR is crucial for achieving good performance during training.
To further evaluate and compare the time-consuming performance of different models in model training, the study assesses the model's training time and convergence time. Training time refers to the duration spent by the model to train on the training set, depending on factors such as the number of iterations, model size, and experimental hardware environment. Convergence time, on the other hand, relates to the time taken for the model to reach a relatively stable performance output and is influenced by the model size and other parameters. Table 9 presents the training time required by each model (ID CNN [1], SAE + 1D CNN [11], and the model proposed in this study) to complete 5000 epochs and the time taken to reach convergence based on the same test set using the ISCX dataset.
As indicated in Table 9, the model's training time for completing 5000 epochs on the ISCX dataset is 1967.76 s, which corresponds to only 93.4% of the training time of the ID CNN model and 88.8% of the training time of the SAE + 1D CNN model. Moreover, the convergence time for completing 5000 epochs on the ISCX dataset is 589.17 s, representing only 81.7% of the convergence time of the ID CNN model and 85.9% of the convergence time of the SAE + 1D CNN model. Tables 6 and 9 demonstrate that the proposed model achieves lower time usage compared to the 1D-CNN model with the simplest structure while maintaining an accuracy rate that is over 5% higher. Compared to the SAE + 1D CNN model, the proposed model significantly reduces training time, advances convergence time, and increases the average training accuracy by more than 3%.
In summary, the proposed model demonstrates excellent performance and requires less training time than other models. However, it is important to note that actual training time and convergence time may vary due to various factors, including dataset and model parameters.
Therefore, careful consideration of these factors is crucial when selecting and optimizing models for specific applications.

Conclusion
Encryption stream and network application stream classification are important issues in the field of network security, which are of great significance for real-time monitoring and defending against network attacks. Aiming at the problem that existing network traffic generally has various types and it is difficult to effectively identify malicious traffic from non-encrypted and encrypted traffic, this paper proposes a deep learning traffic model based on the combination of LSTM, CNN, and SE methods. This method first eliminates the problem that some old classical machine learning methods rely too much on the accurate extraction of network traffic features. Secondly, by using the LSTM method to automatically obtain the time series features and the CNN method to obtain the spatial features of the network flow, the problems of the temporal correlation of features and the incomplete feature space are well resolved. In addition, by embedding the SE mechanism in the CNN, the correlation of the channels between different layers of the network is further analyzed to improve the accuracy of model feature selection. Judging from the results of different experiments, it fully reflects that the method proposed in this paper is indeed feasible and can basically handle the classification of different network traffic.
The model proposed in this paper can effectively enhance the capabilities of enterprises in detecting and defending against network attacks. By integrating it with IDS, IPS, and other network security systems, enterprises can achieve accurate identification and classification of network traffic. The model demonstrates higher accuracy in network traffic classification, effectively reducing false positives and false negatives. This improves the detection accuracy and reliability of IDS. The model also strengthens its feature extraction capability, allowing for a more comprehensive description of network traffic characteristics. This provides IDS with more accurate information, enabling better identification and classification of network traffic. The introduction of the SE mechanism further enhances the model's robustness by analyzing channel correlations across different layers of the network. This empowers IDS to effectively respond to variations and threats in network attacks.
With the rapid development of Internet and the gradual improvement of security awareness, there are more and more types of network traffic and encrypted traffic forms, which puts forward higher requirements for the detection rate and accuracy of the model. The future work of this