ODTC: An online darknet tra ffi c classification model based on multimodal self-attention chaotic mapping features

: Darknet tra ffi c classification is significantly important to network management and security. To achieve fast and accurate classification performance, this paper proposes an online classification model based on multimodal self-attention chaotic mapping features. On the one hand, the payload content of the packet is input into the network integrating CNN and BiGRU to extract local space-time features. On the other hand, the flow level abstract features processed by the MLP are introduced. To make up for the lack of the indistinct feature learning, a feature amplification module that uses logistic chaotic mapping to amplify fuzzy features is introduced. In addition, a multi-head attention mechanism is used to excavate the hidden relationships among di ff erent features. Besides, to better support new tra ffi c classes, a class incremental learning model is developed with the weighted loss function to achieve continuous learning with reduced network parameters. The experimental results on the public CICDarketSec2020 dataset show that the accuracy of the proposed model is improved in multiple categories, meanwhile, the time and memory consumption is reduced by about 50%. Compared with the existing state-of-the-art tra ffi c classification models, the proposed model has better classification performance.


Introduction
The darknet can meet the needs of Internet users for identity concealment, and users can only access it through specific anonymity tools. Nowadays, the commonly used anonymous tools include Tor, I2P, VPN, JAP, and others [1]. These anonymous tools can be used to hide the user's identity information and counter traffic analysis technology. While protecting user privacy, the darknet also brings challenges to the network order. Accurately and effectively classifying darknet traffic has positive and important significance in the field of network security [2].
Currently, common methods for traffic classification includes port number-based, machine learningbased and deep learning-based [3]. The port number-based identification method depends on the corresponding relationship between the port number and the application. With the gradual maturity of port obfuscation technology, it is difficult to identify the corresponding relationship between ports and applications. This will lead to a serious decline in the method's effectiveness. The identification method based on deep packet inspection (DPI) [4] implements traffic classification by defining regular expressions for different categories. The DPI method is effective for plaintext traffic and is useless for ciphertext traffic. With the increasing proportion of encrypted traffic, this method gradually fails [5].
At present, many researchers adopt traditional machine learning algorithms based on artificial features to solve the darknet traffic classification problem. For instance, several conventional machine learning methods, such as the Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), and others (Hu et al. [6], 2020; Montieri et al. [1], 2019), have already been applied to it. However, these methods either rely heavily on hand-crafted features or resort to a timeconsuming feature selection process [7]. Thus, they may lead to unstable performance when dealing with different network environments.
As a further improvement, deep learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have also been introduced to automatically extract high-level features representations from network traffic (Niloofar et al. [8], 2021; Hu et al. [9], 2021; Habibi et al. [11], 2020; Lin et al. [12], 2021; Kwihoon et al. [13],2022). These methods significantly reduce the reliance on expertise and improve the generalizability of different network environments. The main purpose of these methods is to extract new, more expressive, meaningful and distinctive features from the original packet through automated methods.
Although the above methods can achieve pretty classification performance, there are still many deficiencies. They often ignore the relationship among different features and perform poorly when darknet user behavior is similar [17]. The nature of the used features is relatively simple, and the models cannot accurately fit the nonlinear relationship between the input and output. The models need to be retrained when faced with new traffic types [18]. Some user behaviors in the darknet are too similar, making it necessary to extract detailed features to enhance their representation.
To solve the above problems, we propose an online identification method of encrypted traffic based on multi-modal self-attention chaotic features. This method combines the statistical features of anonymous traffic and load features to reconstruct the feature mode of traffic to obtain richer feature, and introduces incremental learning, which can continuously learn new traffic categories without retraining the model. The proposed method accurately classifies user behavior in the darknet and meets the online parameter update requirement of new traffic classes. The main contributions of this paper can be summarized as follows.
(1) We propose an end-to-end traffic classification model based on CNN-BiGRU and MLP. The model utilizes the feature of automatic feature extraction in-depth learning and combines multimodal inputs of traffic images and text. On one hand, the traffic payload is encoded and input into CNN-BiGRU to obtain spatio-temporal features. On the other hand, MLP is used to process traffic text features. This method combines different modal features to enhance the model's representation ability and improve the classification accuracy.
(2) We introduce a local feature amplification module that is composed of a multi-head self-attention mechanism and a Logistic chaotic map. The multi-head self-attention mechanism can look for the correlation between local and global features, while the Logistic chaotic map amplifies fuzzy features.
(3) We use an incremental learning model and weighted loss function to continuously learn new traffic categories without retraining the model. The model is trained with partial samples of old and new classes to overcome the parameter forgetting problem of deep learning models. At the same time, we introduce a deviation correction layer to deal with the imbalance between old and new sample classes. This reduces network parameters and saves training time and resource consumption, significantly enhancing the model's online update capability.The experimental results on public datasets CICDark-net2020 and Darknet2020 demonstrate that the proposed method outperforms existing methods in the classification of darknet encrypted traffic.
The rest of the paper is organized as follows: Section 2 summarizes the current research achievements in the analysis of darknet encrypted traffic and identifies the shortcomings of existing studies. Section 3 provides a detailed description of the proposed method for analyzing darknet encrypted traffic. Section 4 presents the dataset used in this study, along with the specific experimental parameters and analysis of the experimental results. Finally, we conclude this paper in section 5.

Related works
In recent years, researchers have carried out a lot of fruitful work in darknet encrypted traffic analysis. Shahbar et al. [19]proposed to use the C4.5 decision tree based on the statistical characteristics of network flow to analyze the user behavior traffic on the I2P network. Rao et al. [20] proposed an unsupervised method based on gravitational clustering to identify Tor anonymous traffic from normal network traffic. Iliadis et al. [21] conducted a comparative study on feature selection and classification models, and the important flow statistical features are selected to accurately classify the CICDark-net2020 darknet traffic dataset.
Researchers tend to use deep learning methods for darknet traffic classification because of their powerful nonlinear learning capabilities which can automatically extract deep features. Sarwar et al. [22] adopted the improved convolutional recurrent neural network CNN-LSTM and CNN-GRU deep learning methods for darknet traffic classification to complete the application identification task. In addition, they compared the performance between traditional machine learning and deep learning methods. The results show that deep learning is more accurate than traditional machine learning methods in classifying darknet traffic. Shapira et al. [14] intercepted the network flow over a period of time, normalized it into a two-dimensional histogram of arrival time and packet size, and uses 2DCNN to classify it. Habibi Lashkari et al. [11] proposed the DeepImage method, which used a tree classifier to select important statistical features for one-hot encoding and converted them to grayscale images, and finally 2DCNN was used to classify darknet traffic.
To improve the representation ability of darknet traffic features, spatiotemporal feature fusion methods and attention mechanisms are gradually applied to classification models. Yao et al. [23] proposed an LSTM network with a hierarchical attention mechanism. Xie et al. [24] proposed the HSTF-Model to extract spatial and temporal features at the packet level and flow level. Hassan et al. [25] proposed a deep learning model based on CNN and an improved LSTM network, which used CNN to extract spatial features from network traffic data, and used LSTM to extract temporal features from the above features extracted by CNN. Kanna et al. [26] proposed a hybrid model of improved CNN and multi-scale LSTM networks for extracting spatiotemporal features from input network streams. Liu et al. [27] proposed a Bi-GRU network with an attention mechanism as a lightweight model to classify traffic. Angelo et al. [28] adopted the method of spatiotemporal feature fusion to enhance the representativeness of traffic features, and at the same time introduced an attention mechanism to increase the weight of the features that contribute the most to traffic classification. However, these methods do not extract the intrinsic connection between local and global features at different locations, but only use the attention mechanism to assign weights to feature vectors. The traffic generated by some different user behaviors is very similar, and it is difficult to extract fuzzy features that affect the classification performance. Fu et al. [29] proposed a novel MSIF method in response to the counterintuitive results that may arise from highly conflicting evidence in multisource information fusion. This method is based on a newly defined generalized evidential divergence measure among multiple sources of evidence and provides a new approach for extracting fuzzy features.
The types of dark network applications are complex, constantly changing and growing, thus, the capability update online is crucial to the model. However, most existing models can only update parameters offline. Therefore, researchers began to reduce the parameter update time of the model to make it has certain adaptability. Lopez et al. [30] proposed to extract the port number, packet payload length, packet interval time, window size and other attributes of the first 20 packets of the data stream form a 20×6 matrix, and input it into the CNN-LSTM model. Vu et al. [31] selected only 22 features to characterize the flow and used a generative adversarial network CGAN to address the class imbalance problem. Wang et al. [32] truncated the first 784 bytes of the flow and constructed an end-to-end model based on CNN for online traffic classification. Liu et al. [33] characterized the network flow with the long sequence of the first 128 packets in the flow and input it into the LSTM to identify the flow online. Although the classification efficiency is improved by the methods above, it is difficult to realize the online update of the model parameters. After the traffic class changes, the model needs to learn from scratch, which consumes a lot of time and memory. In addition, due to memory constraints, it is not possible to retain a large number of samples from the old class, resulting in an imbalance between the new and old classes. During the fine-tuning process, the model expands its knowledge by learning from samples of the new class while trying to preserve its understanding of the old class as much as possible. However, limited memory capacity means that only a limited number of samples from the old class can be retained. This may lead to difficulties in handling the old class by the model, as it does not have enough samples to accurately represent and distinguish these classes. This imbalance will directly affect the performance of the model.
In order to solve the above problems, the traffic is input into the model in a multimode way [34]. Based on CNN-Bi-GRU network and MLP network, spatiotemporal features and high-level abstract features are extracted. We integrate the multi-head self-attention mechanism [35] into the feature extraction process, so as to make full use of the internal relationship of spatiotemporal features extracted at different locations. Logistic chaotic mapping amplifies ambiguous features, so it is helpful to obtain highly discriminative features for fine-grained darknet traffic classification. In this paper, convolution cores with different sizes are used in the network, and multiple granularity features are fused to classify the dark network traffic more accurately. Furthermore, in order to facilitate the online update of the model, we adopt the incremental learning [37] method. First, inheritance learning is performed using a weighted loss function. Then, a deviation correction layer [36] is introduced to deal with the imbalance problem between old and new class samples. This approach significantly improves the online update efficiency of the proposed model.

The proposed method
The overall framework of the proposed model is shown in Fig.1, which mainly includes three modules: a multimodal reconstruction module, a feature learning module, and an online update module. The detailed descriptions of each module are shown as follows.

Model overview
The multimodal reconstruction process primarily consists of three steps. 1) The packet is divided into flows based on the quintuple information.
2) The obtained flows are reconstructed into two feature modalities: one containing the original payload content and the other containing statistical information.
3) Both feature modalities are jointly inputted into the flow feature learning module.
The feature learning module consists of a Multilayer Perceptron (MLP), CNN-BiGRU network, multi-head self-attention mechanism, and Logistic chaos mapping. It mainly involves four steps. 1) The statistical features are fed into the MLP network to extract abstract high-level features. 2) The payload content is inputted into the CNN-BiGRU network to extract spatial feature vectors from the convolutional layer, which are then fused with the output of the BiGRU layer.
3) The multi-head selfattention mechanism is employed to capture the intrinsic relationships between temporal and spatial features. 4) Logistic chaos mapping is used to clarify and disentangle the fuzzy features, resulting in chaotic feature vectors.
The online classification and update module consists of three steps. 1) Utilize the parameters of the feature learning module to predict unknown samples. 2) Employ a weighted loss function to handle new classes. 3) Introduce a deviation correction layer to address the imbalance between new and old samples. When a new class emerges, the model performs small-scale incremental training to update its parameters. Query matrix of the self-attention mechanism V Value matrix of the self-attention mechanism F c Output of the content feature learning module F t Output of the text feature learning module F m Output of the multi-head self-attention mechanism Y Ture label of ground truth presented in a one-hot encoding vector Y p Predicted probability vector of being a specific application type

Traffic multimodal reconstruction module
The traffic multimodal reconstruction starts with traffic pre-processing, which includes three steps: traffic filtering, traffic segmentation, and traffic cleaning. The purpose of traffic filtering is to discard irrelevant packets such as damaged packets, repeated packets, and protocol packets that are not related to classification. Traffic segmentation combines packets with the same five-tuple information (source and destination IP and port can be interchangeable) into flow samples. Traffic cleaning process randomizes the MAC and IP addresses of the samples to avoid overfitting. The anonymous network flows processed by the above steps are used as traffic samples for the experiments in this paper.
The deep learning models can handle multiple data modalities, each contains different information attributes. As shown in Fig. 2, multimodal inputs can be processed to obtain more comprehensive features, resulting in stronger traffic representation. The multi-modal input module includes three types of traffic input forms: traffic statistical features, intra-flow packet length sequences, and raw bytes of data packets.
(1) Traffic statistical characteristics Traffic statistical features are extracted from bidirectional network traffic samples. When establishing a communication, the transmission direction of the first data packet is considered as the upstream traffic, and the vice versa is considered as the downstream. As shown in Table 2, these statistical features have better capabilities to characterize the pre-filtered traffic, such as Flow IAT (inter-arrival time between packets in a flow), and Idle time (time interval between flows). For different types of traffic, such as email, chat, and VoIP, the interaction frequency varies, resulting in different Flow IAT values. The idle time also varies for different types of traffic; For example, browsing behavior generates traffic where the browser typically requests multiple resources from servers and needs to frequently send and receive packets, resulting in shorter transmission intervals. While, for P2P traffic, data transmission between nodes usually waits for responses from other nodes, resulting in longer transmission intervals. These statistical features can reflect the characteristics of different types of traffic and can be used as the basis for traffic classification.
(2) Characteristics of packet length sequence within a flow Python and its Dpkt tool library are used to parse the samples to get the packet length sequence in network flows. To explore the impact of different packet sequence lengths on traffic classification results, different length of packet sequence within flows are selected for processing by the feature learning module. The packet length sequence of different types of traffic has certain regularities, which often reflect the characteristics of information interaction during communication. The P2P traffic typically involves large file transfers, which need to be split into smaller data blocks, and the packet length sequence typically follows a long-tailed distribution. The browsing and chatting traffic are relatively small, and the packet length sequence is usually smooth and continuous. The packet length sequence of file transfer traffic follows a bimodal distribution, with a relatively large number of both large and small packets. Although some traffic features are hidden after obfuscation and encryption, this processing usually only adjusts the data packets to some extent and does not completely change the packet length sequence features. Therefore, packet length sequences can still be used for traffic classification. In the real network environment, the length of each flow is different, however, the CNN-BiGRU model requires a consistent size for the input data. We intercept a fixed number of bytes for this. The maximum length of a packet is 1,500 bytes, as defined by the maximum transmission unit in the network, thus, we choose to intercept the header of 1,024 bytes. The first 1024 bytes of the header are not only convenient for model input, but also can reflect the traffic characteristics more accurately. The previous bytes contain more valuable information than the later bytes, which are used to establish connections and exchange state information between the two communicating parties.  Network traffic has a hierarchical structure, including packet-level and traffic-level, which contains rich information. The traffic statistics, length sequences of packets in the flows, and raw bytes of packets represent different information properties, which are the three traffic modes. Fully mining and integrating the information in these three modes for classification tasks can make the features more comprehensive and representative. Moreover, single traffic information may be affected by network environment fluctuations, and multimodal traffic information can enhance the robustness of the model.

Feature engineering
The XGBoost algorithm is always used to evaluate the importance of flow features in order to reduce the feature dimension [39]. The input vector X = {X 1 , X 2 , X 3 , · · · X n }uses different features and values to split the data set. The algorithm structure is similar to a tree model. When looking for a certain feature and its corresponding size to divide the dataset, the goal is to maximize the gain of the objective function. The features corresponding to the nodes with the largest gain should be sorted in order to obtain the important feature vector X = {X 1 , X 2 , X 3 , · · · X n }, m < n after dimensionality reduction. The gain function can be formalized as are the gain scores for the left and right subtrees. (G L + G R ) 2 /(H L + H R + λ) is the undivided gain fraction. The symbol γ represents the cost of complexity and gain function is obtained by the difference between the objective functions before and after the split, which can be denoted as The Ob ject function can be denoted as In Eq.(3.3), n i=1 l(y i ,ŷ (k) ) is the loss function, which indicates the difference between the actual value and the predicted one. k k=1 Ω( f k ) is the regularization function that controls the complexity as a penalty term. y i is the true value,ŷ (k) is the predicted value of the k-th tree. The objective function of training the k-th tree be Ob ject k . The parameters of the K − 1th tree are regarded as known constants, and the objective function can be formalized as is the prediction result of the i-th sample in the k-th tree. The Taylor series is used to approximately expand Ob ject k , and the calculation process can be represented as In Eq.(3.5), g i is the first derivative of l(y i ,ŷ (k−1) ) with respect toŷ (k−1) , and h i is the second derivative of l(y i ,ŷ (k−1) ) with respect toŷ (k−1) . The calculation process of the i-th sample of the k-th tree is f k (x i ) = ωq(x i ). ω is a vector that stores the values of the child nodes. q(x i ) indicates that the sample x belongs to the child node i. The set I j is defined as the samples that fall on the j node, and the penalty term can be formalized as In Eq.(3.6), T is the number of nodes, γ and λ are the control weights. The objective function can be expressed as where G j = i∈I i g i , H i = i∈I i h i are known constants, the objective function is essentially a onedimensional quadratic function. When ω j = −G j /(H j + λ), the objective function takes the maximum value, which can be formalized as As shown in Table 3, part of the high correlation features is given, and the importance of the traffic statistical features is ranked according to the node gain value.

Statistical Features and packet length Sequence learning
MLP [40] is a typical machine learning model with strong nonlinear representation ability, which is widely used in classification tasks. Each layer in an MLP is a linear mapping representation whose output is equal to subtract offset from the product of input and weight. Multiple layers of such linear mapping representations are combined to solve nonlinear problems. This is equivalent to combining where W i , b i , and f i are respectively the weight matrix, bias vector and output vector of the i-th layer. g(·)is the nonlinear activation function. As shown in Fig.3 Figure 3. Packet length sequence and statistical feature learning process.

The spatiotemporal feature learning
The spatiotemporal feature learning part consists of CNN, self-attention mechanism and Bi-GRU [41]. It uses the CONV layer to extract the spatial relationship features of the original byte X c . Each CONV layer consists of a convolutional layer, a pooling layer and a Relu activation function with different convolution kernel sizes. The calculation process of the convolutional layer can be denoted as where f convi is the convolutional layer with kernel size i, g pool is the pooling layer, and g r is the nonlinear activation function. A self-attention mechanism layer is added after the CONV layer to obtain the intrinsic relationship of the features extracted by the CONV layer and improve the feature richness. After the self-attention mechanism layer, in order to obtain the time series features in the network flow, the BiGRU network is used to extract the features of X c . The calculation process of the BiGRU layer can be formalized as where h t−1 is the memory state of the previous time t − 1, X c is the feature vector of the current time t, and X ci is the input vector of the current time t. Each element of the output vector of the reset gate neuron r t and the input neuron z t is between 0 and 1. It is used to control the amount of information. The activation function of the memory gate neuron is tanh, and the elements of the output vector are all between -1 and 1.The reset gate neuron r t , the memory gate neuronh t , and the input neuron state z t update computation can be formalized as where σ represents the nonlinear activation function sigmoid, W r , b r , W z , b z , W h , b h are the weight matrix and bias vector parameters of the reset gate, input gate and memory gate neurons, respectively.
After the state value of each gate is achieved, the output value h t of this unit can be formalized as As shown in Fig.4, the vector X c is input into this module. After multiple convolution layers with different convolution kernel sizes to extract spatial features, BiGRU layers to extract time series features, and finally the spatiotemporal feature F c is obtained.

The Multi-head self-attention mechanism
The self-attention mechanism calculates the similarity between each feature vector and learns the relationship between each feature. The multi-head self-attention mechanism [35] is an extension method of the self-attention mechanism. The difference between them is that the latter performs one linear transformation on Q, K, and V, while the former performs multiple linear transformations on Q, K, and V. The attention value is calculated separately for each linear transformation, and the results are concatenated. This can effectively improve the fitting ability of the model. Fig.5, the attention i value between features in parallel are computed, and the final attention matrix F m (Q, K, V) can be achieved. In order to achieve the purpose of detection and classification. F m (Q, K, V) can be denoted as F m (Q, K, V) = {Attention 1 , Attention 2 , · · · · · · , Attention 3 } (3.16) where Q, K, and V represent the query matrix, key-value matrix, and value matrix, respectively. Attention i is the attention value calculated by the i-th linear transformation of Q, K, and V. The calculation process of Attention i can be expressed as

As shown in
where W qi ,W ki ,W vi are the weight matrices after the i-th linear transformation. In our model, the number of heads H is 7 and the output of the multi-head attention mechanism can be denoted as F m . It will be used in the subsequent feature fusion module.

The Logistic Chaos Map
As it is known chaos arises from nonlinear dynamical systems. A dynamical system describes a time-varying process that has a very sensitive dependence on the initial value. The process is deterministic, quasi-random, aperiodic, and convergent. The mathematical expression of a one-dimensional logistic chaotic map is a difference equation that was first used to describe changes in quantity. Later, as its characteristics met the requirements of a serial cipher, it became widely used in the field of secure communication. The calculation process can be formalized as As shown in Fig.6, when the mapping parameter µ is set to 4, the value of the feature vector after chaotic mapping is locally amplified [42]. The output values will tend to a certain defined class, depending on the difference in the initial values. The value of the multidimensional feature vector indicates the degree of difference. This makes the mapped feature vectors easier to distinguish when there are small differences in the value of feature vectors with multiple dimensions. We only focus on the features with blurred edges and do not consider deterministic features after chaotic mapping. The partially magnified fuzzy features are magnified in different degrees according to their initial values. The chaotic map feature vector and the original multi-dimensional feature vector are combined as the classification basis.  Figure 6. Logistic chaotic map feature map.

Feature fusion and classification
The feature fusion process is divided into two parts. The first part is self-attention feature fusion. In order to more comprehensively learn the relationship between different location features of the network flow, a multi-head self-attention mechanism is used in the overall model. The query matrix Q and key matrix K are used to generate the distribution of attention weights, and the value matrix V is used to obtain the selected information. The computational process of the attention feature fusion can be denoted as where W o and b o are the weight and bias values of the feature fusion part, respectively. F m i s the output feature vector of the multi-head self-attention mechanism. The other part is multimodal feature fusion. We concatenate high-level statistical features extracted by MLP and features fused by self-attention [43]. The fused features serve as the final basis for classification. The fused features are fed into the S o f tmax layer. The difference between the true class label and the predicted result is taken as the loss value during backpropagation. For the class imbalance problem in darknet traffic, we adopt a focal loss [44] function to deal with it. The calculation process of the classification layer can be formalized aŝ

Online classification and update module
The module continuously optimizes the parameters through the true label Y and the predicted label Yp. Online update module is divided into two cases. We use the parameters fitted by the feature learning module to predict the old class samples. After a new category sample appears, the small-scale inheritance training should be performed on the original model. When learning new types of data incrementally, most machine learning technologies will encounter catastrophic forgetting problems, which will lead to a sharp decline in the performance of old classes. Therefore, a weighted loss function L is used to retain the training memory, and the calculation process can be formalized as L d and L c represent two types of loss functions, which measure the degree of difference from the original model and the true label, respectively. The calculation process of L d can be denoted as where H represents the ability to learn parameters from the old class model. We set the appropriate H value to preserve the model parameters learned for the old class as much as possible.X n ∪ X m represents a collection of old and new class samples.ô k (x) is the prediction output vector. o k (x) is the output vector of the original model. L c measures the length of the difference between the predicted vector and the true label and it is used in all categories. Its calculation process can be formalized as where p k (x) is the true label of the sample, andp k (x) is the predicted output result of the k-th class sample through the S o f tmax layer. The model trained by combining the inheritance of the loss function will be more inclined to classify the samples into new categories. The bias correction layer can make the data more evenly distributed in the network, thereby reducing the dependence on the traffic characteristics of the new category and improving the generalization ability of the model. The calculation process of the bias correction layer is shown in Eq.(3.24): where w and v are the training parameters of the positive deviation layer, and o k is the output feature vector of the k-th class. The output feature vector of the old class is retained, and the feature vector of the new class is corrected by the bias correction layer to obtain the final classification result.
The new model is initialized with the trained parameters of the old model, and the new class data set and a small part of the old data set are used for inheritance training [45]. The last pre-trained layer is truncated and replaced with the classification layer with the new class output. In order to ensure the weight parameters learned by the old model on the previous categories are not easily forgotten, a learning rate that is ten times smaller than the original is used to update the parameters during the inheritance training process.

Experimental Results and Analysis
The public CICDarknet2020 [46] and Darknet2020 [6] dataset is used in this paper to evaluate the model performance. The CICDarknet2020 dataset consists of VPN and Tor network traffic, where VPN traffic comes from ISCXVPN2016 [47] dataset and Tor traffic comes from dataset ISCXTor2016 [50] dateset. The network flow samples include the normal and darknet network traffic, in which the number of normal network flows is 10248 and the number of darknet network flows is 21041. The user behavior traffic in CICDarknet2020 can be divided into 8 categories in total (called Dataset 1 in this paper). As shown in Table 4, the dataset includes Audio, Browsing, Chat, Email, P2P, File Transfer, Video and VoIP.

Experimental description
As shown in Table 5, the experimental environment used in these experiments: Windows10, system processor: Intel(R) Core (TM) i5-9300HF CPU @ 2.40GHz quad-core, RAM: 16GB, system type: 64-bit operating system x64-based processor, graphics card: Nvidia GeForce GTX1650. The PyTorch deep learning library, as well as Wireshark, Anaconda, and PyCharm is used in the experiments.

Evaluation indicators
In this paper, the Accuracy, Precision, Recall, F1-Score, and Confusion Matrix are used to evaluate the classification models [51]. The accuracy rate describes the overall performance of the classifier, and the precision rate evaluates the classification effect of each category in the classification problem. The F1-Score is used to evaluate the performance of the classifier. The confusion matrix can be used to observe the classification of each category in detail. T P means that a positive sample is correctly identified as a positive sample. T N means negative samples are correctly identified as negative samples. FP means that negative samples are misidentified as positive samples. FN means that positive samples are misidentified as negative samples. We used feature engineering to rank the importance of network flow temporal statistical features and input features sequentially based on feature importance. The goal was to use a smaller number of features to achieve the best accuracy. As shown in Fig.7, the model achieved high accuracy and tended to stabilize with the top 15 important features. Therefore, we chose N=15 for the experiment.

Packet sequence length
In order to make full use of the sequence characteristics of packets in the stream, the length of the packet sequence is gradually increased. As shown in Fig.7, the model is more accurate and tends to be stable when the packet size sequence length M reaches 80. Therefore, we chose M=80 for the follow-up experiments.

Standard bytes
To explore the impact of standard bytes on the classification results, we intercepted different flow bytes for experiments. As shown in Fig.7, the classification accuracy rates under different standard byte lengths were given. As the number of network stream bytes increased from 100 to 700, the accuracy increased significantly. It is noted that the addition of standard bytes would provide more characteristic information. The curve change showed that the provided characteristic information tended to saturate when the standard byte reached a certain length. Therefore, we chose L=784 for subsequent experiments. It can provide enough information and improve the efficiency of data processing. (1)Ablation experiments The proposed method contains multiple classification performance gain parts. In order to evaluate the contribution of each part to the final classification performance of the model, an ablation experiment is performed.
The accuracy and F1-score are 86.46% and 86.51%, respectively, when MLP is used to deal with statistical features and in-flow packet length sequence features. Due to the limited amount of infor- mation carried by packet length sequence and statistical features, the lack of representation of features leads to the method using only these two features is not accurate enough in classifying user behavior traffic.
When the original bytes of the packet are input into the CNN-BiGRU cascade network to extract the spatio-temporal fusion features, the model accuracy and F1-score are 88.27%and 87.98%respectively, which are 1.81%and 1.47%higher than those using statistical features and in-flow packet length sequence features. This shows that in the process of user behavior traffic identification, the information carried by the original bytes of the packet is more abundant, and the traffic category is more representative after extracting the spatial and temporal features.
After introducing the self-attention feature fusion module, the accuracy and F1-score of the model are increased by 2.11% and 1.98%, reaching 90.38% and 89.96%, respectively. This shows that the module can analyze the relationship between features from multiple perspectives, extract feature correlation and assign more weights to important features. Features with high correlation can provide more valuable information, so as to improve the classification ability of the model for traffic.
After adding statistical features and packet length sequence features, the classification accuracy and F1-score of the model reach 91.41% and 91.08%, respectively. It shows that multi-modal data input can provide feature information of different nature, enhance the characterization ability of features for traffic, and contribute to the fine-grained classification of anonymous network traffic.
(2)Comparison with the state-of-the-art methods To reflect the advantages of the proposed method in terms of classification performance, this paper compares it with state-of-the art methods. The comparison results of each category of applications are shown in Table 8. The experimental results are obtained by conducting traffic classification experiments on the CICDarket2020 dataset. Fig.8 shows the accuracy of different classification models on different applications of the darknet traffic dataset. The proposed model achieves the best performance in most categories and is slightly lower than the DarknetSec [17] method in the Browsing and Video-Stream categories. Overall, our method achieves the best classification performance.
It can be observed that the proposed method achieves high accuracy, exceeding 95%, in classifying categories such as Audio, Chat, and File-Transfer. However, the classification performance is relatively lower, reaching only above 80% for categories like Browsing and Email. It is attributed to the larger number of samples available for categories like Audio compared to the smaller sample size for categories like Browsing. As a result, the model struggles to obtain sufficient learning for the latter. Therefore, the future work section of this paper also suggests addressing and resolving the issue of small-sample class classification. This will help improve the overall performance and generalization capability of the model and deserves further attention and exploration in future research.
In addition, the proposed method has an online update function. When the traffic category increases, the combination loss function is used for inheritance small-scale training and the deviation correction layer is used to correct the parameters. The browsing and email categories were taken as new traffic categories, and compared with the retraining method, the fine-tuning method and the combined loss function method respectively. The retraining method retrains the model according to all the old and new data samples to adapt the model parameters to all the old and new traffic categories. The fine-tuning method is to freeze the upper-layer parameters of the model and update the parameters to adapt to the new category using only the fully connected layer. The combined loss function method is weighted by two loss functions. One loss function measures the difference between the predicted label and the true label, and the other loss function measures the difference between the predicted label of the new model and the old model.
In order to better compare the classification performance with different methods, all indicators are normalized. Figure 9 shows that although the re-training method performs best in the old and new categories and the overall accuracy, it consumes a lot of time and memory space, which is not conducive to the timely update of model parameters. Fine-tuning methods tend to identify predicted samples as new categories, resulting in severe parameter forgetting on old categories and poor overall classification performance. The combined loss function method has a good classification performance on the new class samples, but the accuracy is low on the old class samples. The model in this chapter uses the inheritance training of the combined loss function, and corrects the parameters through the deviation correction layer. This method has less time consumption, occupies less memory space, and is easy to update the model parameters in time. It has achieved good classification results in both new and old categories. Although the accuracy of the relative proportion training method is slightly reduced on the old categories, the parameter update efficiency is significantly improved, and the overall classification efficiency is greatly improved. Therefore, this method can add the new category traffic to the classification model faster while ensuring the accuracy. we conduct comparative experiments on the public dataset Darknet2020 to classify the user behavior traffic of three anonymous network tools (I2P, Freenet [52], and ZeroNet [53]). As shown in Table 9, the Darknet2020 dataset consists of four common darknet traffic, Tor, I2P, Freenet, and ZeroNet, and contains 25 darknet user behaviors (called Dataset 2 in this paper). The comparison experimental results are shown in Fig.10. It can be observed that among all types of anonymous network traffic, the proposed method achieves higher accuracy and F1-score compared to other anonymous traffic classification methods. It is easily achieved that the proposed method has the best performance.

Conclusions and Future Works
This paper proposed a general multimodal multitask DL architecture ODTC for multipurpose classification It is aimed to provide an effective design basis for sophisticated darknet traffic management.
The MLP and CNN-BiGRU are adopted to process the features of the two modalities respectively. And the multi-modal information, including flow statistical features and payloads, is used to reconstruct network traffic. The flow forms of different modes have different characteristics, which greatly enriches the diversity of characteristics. Afterward, we perform chaotic mapping on the features, amplify the ambiguous features, and it can enhance the representation ability of the features. The multihead attention mechanism is able to extract the intrinsic relationship between features and improves classification performance at the same time. In addition, an incremental learning model is adopted which consists of two advantages. On one hand, inheritance learning uses a weighted loss function to improve training efficiency. On the other hand, a Deviation correction layer is introduced to address   Browsing  1281  1921  7972  4990  Chat  841  442  1531  1123  Email  553  1084  352  2980  File Transfer  1077  1791  2157  4897  P2P  1018  2910  1394  -Audio  1567  -820  -Vedio  1703  -1251  2397  VoIP  592  ---Total  8632  8148  15477  16387 the imbalance between old and new class samples.
By comparative experiments, it is verified that the proposed method outperforms existing methods. In addition, the effectiveness of the proposed model for other various other types of darknet traffic datasets is also verified. In the future, we will investigate the impact of class incremental learning on model performance.
Category incremental learning can help us adapt to ever-changing environments and data, enabling the model to quickly learn and adapt to new categories without retraining the entire model. With the development of technology and the accumulation of data, we need to be able to handle large-scale and constantly changing datasets. Understanding category incremental learning can help us improve existing incremental learning algorithms and develop more efficient and robust models.