CDBC: A novel data enhancement method based on improved between-class learning for darknet detection

: With the development of the Internet, people have paid more attention to privacy protection, and privacy protection technology is widely used. However, it also breeds the darknet, which has become a tool that criminals can exploit, especially in the fields of economic crime and military intelligence. The darknet detection is becoming increasingly important; however, the darknet traffic is seriously unbalanced. The detection is difficult and the accuracy of the detection methods needs to be improved. To overcome these problems, we first propose a novel learning method. The method is the Chebyshev distance based Between-class learning (CDBC), which can learn the spatial distribution of the darknet dataset, and generate “gap data”. The gap data can be adopted to optimize the distribution boundaries of the dataset. Second, a novel darknet traffic detection method is proposed. We test the proposed method on the ISCXTor 2016 dataset and the CIC-Darknet 2020 dataset, and the results show that CDBC can help more than 10 existing methods improve accuracy, even up to 99.99%. Compared with other sampling methods, CDBC can also help the classifiers achieve higher recall.


Introduction
With the development of the network, users' awareness of privacy protection has been continuously improved, and many users choose to use anonymous communication tools to access the Internet to prevent their privacy from being compromised while surfing [1−3]. Anonymity service such as the second generation onion router (Tor) [4−6], invisible internet project (I2P) [7], Freenet [8,9] and ZeroNet [10] can provide a high degree of anonymity and become an important means of protecting privacy on the Internet. However, these tools can also provide protection for illegal users, which brings difficulties to network supervision. For example, many illegal users use anonymous communication tools to conduct illegal transactions on the darknet. As most people know, darknet [10] is defined as a restricted access network. Common conditions that need to be met are special settings, specific software, authorization or non-standard protocols or port access. Nowadays, there are many types of darknet, and they have gradually become platforms for terrorism and crime [12]. From the perspective of network management, to monitor and even prevent possible illegal activities on the darknet, it is essential to detect the activities of users and is necessary to improve the detection capability. However, in the existing datasets of darknet traffic, the amount and types of darknet traffic are scarce. The detection accuracy is not high enough. To detect a small amount of darknet traffic and its type, we propose CDBC, and based on it, we propose a novel darknet traffic detection method.
The contributions of this paper are summarized as follows.
(1) To solve the problem of the small amount of the darknet traffic. The experiment takes the darknet traffic as a small sample of data, and the CDBC is proposed, which can learn the spatial distribution of the darknet datasets and generate gap data around the small samples to reduce the impact of data imbalance.
(2) To the best of our knowledge, it is the first time that Between-class learning is adopted to solve multi-classification problems, and good results are achieved.
(3) The proposed method enhances the capability of darknet detection, by federating CDBC with over 10 classifiers respectively. Experimental results show that the detection method based on CBDC and random forest achieves an accuracy of 99.99%.
The structure of the paper is arranged as follows. In section II, we mainly introduce darknet detection and Between-class learning. The proposed method is introduced in detail in section III. In section IV, we mainly present the experimental results and analyze them. Finally, the conclusions and prospects of the method we proposed are given.

Darknet detection
Darknet detection can be regarded as a special encrypted traffic detection problem. This section introduces some research work related to darknet traffic detection.

The methods based on machine learning
In 2016, Draper-Gil et al. [13] proposed an encrypted traffic detection method based on time series analysis. The proposed method adopts decision tree (DT) and K-nearest neighbor (KNN) to detect VPN traffic according to different types of traffic, and the detection accuracy is 80%. In 2018, Montieri et al. [1] used machine learning methods such as naive Bayes (NB), random forest (RF) to classify the Anon17 darknet dataset according to different anonymity tools (Tor, I2P and JonDonym), and the reached more than 75%. In 2020, Hu et al. [14] collected a real darknet dataset, including Tor, I2P, ZeroNet and Freenet. Moreover, experiments are conducted on the basis of feature selection and multiple classifiers. The detection accuracy for the types of darknet traffic is 96.9%, and the average detection accuracy for the type of application is 91.6%. In 2021, Rawat et al. [15] applied the term frequency-inverse document frequency (TF-IDF) algorithm in the field of text data mining the darknet traffic detection task, and then detected the darknet based on the LightGBM algorithm. The accuracy is more than 98%. In 2022, Abu et al. [16] proposed a method for detecting darknet traffic based on machine learning, and experiments were performed on the CIC-Darknet-2020 dataset [17]. The authors merged VPN and Tor, and finally, the results showed the accuracy of 99.50%. However, the above detection accuracy still needs to be improved, and they did not pay attention to the influence of the dataset distribution.

The methods based on deep learning
Compared with traditional machine learning methods, the methods based on deep learning can automatically learn features of the traffic. Recently, detection methods based on deep learning have made some progress. In 2019, Liu et al. [18] applied recurrent neural networks (RNN) to encrypted traffic detection and proposed the FS-Net method, which is based on an end-to-end classification. By learning effective features and reconstructing the network, the method mines sequence features, and the feature learning ability are enhanced. In 2020, Habibi et al. [19] proposed a method named DeepImage, which first selects features and generates two-dimensional grayscale images and then uses two-dimensional convolutional neural networks (CNN) to detect darknet traffic. The experimental results showed that the accuracy of the method is 86%. In the same year, Lotfollahi et al. [19] proposed a method called Deep Packet. This method is an automated framework for network traffic feature extraction based on one-dimensional CNN and stacked autoencoders (SAE). The detection accuracy of darknet traffic reaches 98%, and the accuracy of darknet application types reaches 93%. In 2020, Wang et al. [20] proposed an end-to-end method named App-Net, which learns the joint features of traffic and applications by combining RNN and CNN. Finally, annotations for flow sequences and specific applications are simultaneously implemented. In 2021, Sarwar et al. [21] proposed a novel darknet detection method, which improved CNN-LSTM and CNN-GRU. The results showed that the accuracy was 96%. Obviously, the accuracy of the methods based on deep learning is not high enough, and they didn't consider the spatial distribution of small samples in the dataset, which affects the detection.

Between-class learning
The idea of the learning method mainly comes from classification and the recognition of pictures, the sound recognition, etc. [22−24]. Initially, Between-class learning is adopted in sound recognition. It mixes data of two different types in random proportions to generate new data. The new data will be considered adoption data and will be used in the experiment. Tokozume et al. [25] proposed a new deep sound recognition network (EnvNet-v2) based on Between-class learning. In the experiment, the authors mixed two different sounds, created new sounds and used the synthetic dataset to train the model and output the mix ratio. Gao et al. [26] improved Between-class learning and proposed a novel method for anomaly detection, named EBC learning. This method calculates the Euclidean distance before mixing, and then mixes the data with similar distances. Finally, RF is used for detection. However, this method can only solve binary classification problems.

Proposed methodology
In this section, we will introduce the method of darknet detection in detail, which consists of three aspects, data preprocessing, CDBC and detection. The detection framework is shown in Figure 1.

Data preprocessing
In data preprocessing, vectorization, normalization and One-hot are adopted to process the original dataset. Simultaneously, dimensionality reduction is performed on high-dimensional data, which can remove redundant features and retain the most relevant features. It can improve detection accuracy and training efficiency. The dataset has non-numeric features, it needs to be vectorized. We remove some features which cannot be processed. Additionally, IP addresses cannot be processed and calculated as numerical values; so we perform frequency processing on IP addresses. The number of occurrences of the IP address is taken as the characteristic value. For non-numeric timestamp data, we replace it with the number of occurrences in a day or in an hour. For "inf" and "Nan" values in the dataset, we take the average of the features. We adopt a normalization method to normalize the feature values and scale them to [0, 1]. The calculation is as follows: where represents the original feature, and represent the maximum and minimum features. The experiment uses One-hot to label the data.

CDBC
The main idea of CDBC is to generate gap data around unbalanced traffic to enhance the distribution boundaries between different types of traffic. It is important to stress that gap data is not any kind of traffic, but a kind of data between darknet traffic and normal traffic, which is distributed exactly in between them. It is shown in Figure 2. Compared with common methods, CDBC can optimize detection by focusing only on a little traffic. Therefore, the algorithm has obvious advantages. As shown in Figure 2, CDBC finds the k-nearest neighbors of various types of traffic by calculating the Chebyshev distance, and generates gap data between the neighbors. When the training traffic distribution is unbalanced, CDBC can significantly improve the ability of the classifier to identify small samples.
In the experiment, we adopt Chebyshev distance as a metric in multi-classification problems, because it can highlight the difference between traffic, and the calculation equation is as follows: where represents the -th dimension of the features, ℎ ℎ ( , )represents the maximum distance between the traffic and in the -th dimension of the features. CDBC generates gap data by randomly mixing two different types of traffic. The equation is as follows: where is randomly generated, and ∈ (0,1). When CDBC is applied to the binary classification of darknet detection, there are two kinds of labels, where "0" represents the non-darknet traffic label, and "1" represents the darknet traffic label. The label determination method of the generated samples is as follows: One-hot is used to label the gap data. In the experiment, it is represented by the distance of the gap data to the two samples respectively. For example, the labels of minority class samples and majority class samples are represented by One-hot as [0. , 1. ] and [1. , 0. ] respectively, and the labels of gap data can be expressed as [1 − , ], the equation is as follows: We adopt , to denote the Chebyshev distance, is represented by Eqs (2) and (3). The equation is simplified as follows: Finally, the label of can be represented as ( ) = ( , ) = ( , − 1).

CDBC to solve the binary classification tasks
When CDBC is applied to the classification in the binary classification scenario, its main steps are shown in Algorithm 1.

Algorithm 1
Calculate the k-nearest neighbors of in 3. End for 4. For in do 5.
For which is the neighbor of do 6.

Return
As shown in Algorithm 1, the input dataset is (the original dataset is divided into training dataset and testing dataset according to 7 : 3). The training set includes the majority class and the minority class . k and represent the number of selected nearest neighbors and the number of times to generate gap data.
First, we determine the k-nearest neighbors of in , where belongs to . Chebyshev distance is adopted to find the k-nearest neighbors, and it is shown in Eq (2). Then k neighbors are traversed to determine the types of .
Then, based on Eqs (3) and (5), generate gap data, labels, and a new dataset . Repeat the steps until the end of condition is reached.

CDBC to solve the Multi-classification tasks
The idea of CDBC for multi-classification is the same as of binary classification. The advantage is that multi-classification is more scalable and conforms to darknet detection. In this section, we mainly introduce CDBC in multi-classification tasks. For which is the neighbor of do 7.
End for 17.

Return
As can be seen from Algorithm 2, it is different from Algorithm 1. Firstly, there can be multiple majority and minority classes in the Input, and the division of the majority class and the minority class can be customized. Second, it is worth noting that when the sample and its -nearest neighbors generate gap data, the label of the gap data is determined by the label of the neighbors and label of the samples.

Experimental results and analysis
In this section, the experimental environment, datasets, evaluation metrics are introduced and experiments are conducted to verify the effectiveness of the proposed method for detection.
The ISCXTor 2016 dataset is a real traffic dataset recorded by the University of New Brunswick. This dataset includes two scenarios. Scenario A includes Tor traffic and non-Tor traffic. Scenario B includes 8 types Tor traffic. The details of the datasets are shown in Figure 3 Table 1.

CIC-Darknet 2020 dataset (DDarknet and DDarknet-tor)
The CIC-Darknet 2020 dataset is a public dataset of darknet traffic provided by the Canadian Institute for Cybersecurity. There are two layers in the dataset, the first layer (DDarknet) contains four types: Tor, Non-Tor, VPN and NonVPN, and the second layer (DDarknet-tor) contains 8 types which are shown in Table 1.
The DDarknet dataset contains more than 140,000 records, whose distribution is shown in Figure  3(c) DDarknet and (d) DDarknet-tor. Tor traffic accounts for less than 1%, which is extremely unbalanced. The specific and detailed numbers in the datasets are shown in Table 2.

Evaluation metrics
The experiments include binary classification and multi-classification tasks. The binary classification distinguishes darknet traffic from non-darknet traffic. The multi-classification task is to classify the traffic more finely, to facilitate the processing and analysis of traffic types. In binary classification, accuracy (ACC), precision, recall, false positive rate (FPR) and F1-score ( 1 ) are adopted to evaluate the detection. In multi-classification, macro-average is adopted. The calculations are shown as follow: ACC indicates the proportion of correct predictions in all samples and is calculated as follows: Precision indicates the proportion of samples for which the prediction is "1" that are indeed "1". The calculation is as follows: Recall indicates the percentage of samples that are actually labelled as "1", which are actually identified. The calculation is as follows: FPR represents the proportion of positive samples that are wrongly predicted to the total positive samples, which is calculated as follows: F1 is a composite indicator and the core idea is to close the gap while increasing Precision and Recall as much as possible. The calculation is as follows: Macro Precision is an evaluation parameter for multi-classification problems and calculates the average value of Precision. The calculation is as follows: Macro Recall is similar to Macro Precision. It is used to evaluate multi-classification problems. The formula for calculating the mean of the Recalls is as follows: On the same principle, Macro F1 is also used as a composite indicator for evaluating multi-classification problems: where true positive (TP) is the number of correctly identified darknet traffic, true negative (TN) represents the number of normal traffic that is correctly identified, false positive (FP) represents the number of normal traffic that is incorrectly identified as darknet traffic, false negative (FN) indicates that darknet traffic is incorrectly identified as normal traffic.

Experiment and analysis
Two groups of environments are set up in this experiment. The first group adopts CDBC, the second group does not adopt CDBC (without CDBC), and 11 methods are tested respectively, we set = 1 and collect 1 time in total.

Test for binary-classification task
To explore the effect of CDBC on the darknet detection, a comparative experiment is conducted on the DISCXTor-A and DDarknet. Darknet traffic detection can be regarded as a binary classification task. The comparison results are shown in Table 3.
As can be seen from Table 3, the detection performance of the classifiers is better with CDBC. On the DISCXTor-A, the results of 10 methods (except NB) are improved. Especially with the ensemble methods, the accuracy is even close to 100%. On the DDarknet, most of the metrics are improved in the CDBC environment. The experimental results show that in the binary classification task, the detection performance is better than that without CDBC.

4.4.2
Test for multi-classification task In this section, the experiments are tested on multi-classification tasks, and the environment settings are as in the previous section. Considering the binary classification results, the ensemble learning methods are better than the single classifiers, so only 5 ensemble methods are selected in the multi-classification task. The comparison results are shown in Table 4.
In multi-classification tasks, the performance of all CDBC based methods is improved on DISCXTor-B and DDarknet-tor. Generally, CDBC can effectively form the "boundary" between samples and heterogeneous small samples, which helps improve the classification ability of the classifiers. Taking the RF as an example, when the and times are reasonably selected, Figure 4 shows the Recall on the four datasets.
As can be seen from Figure 4, after data enhancement of small samples by CDBC, the Recall of small samples is significantly improved. On DISCXTor-A, DISCXTor-B and DDarknet, Recall is higher than that without CDBC. Because the distribution boundary between darknet and non-darknet traffic is strengthened after using CDBC, the Recall for the categories is improved. On the DDarknet-tor, the Recall of Email improves from 0.2 to 1.0, but the Recall of P2P and FTP also decreases slightly. Based on the above results, CDBC is helpful for the Recall of small samples, and CDBC can effectively assist in improving the detection.

Comparing CDBC and other sampling methods
In this section, CDBC is compared with SMOTE_D 28 and Gaussian_SMOTE 29. The results are shown in Table 5.
As can be seen from Table 5, CDBC performs better on the DISCXTor_B and DDarknet-tor, when comparing SMOTE_D and Gaussian_SMOTE. Although the accuracy of Bagging is not high enough, other classifiers perform better in the CDBC environment. Because the distribution of the two datasets is unbalanced, the gap data generated by CDBC can enhance the classification boundary, which can improve the classification ability of classifiers. The experiment is carried out on the DISCXTor_B, and the value of k ranges from 5 to 100 with an interval of 5, and = 2. The results are shown in Figure 5. As shown in Figure 5, when the value of k increases, the Accuracy, 1 , etc. become lower. Only AdaBoost and XGBoost have little effect on the value of k. However, the effect of other methods oscillates with the increase of k, and the effect generally declines. Especially when GBDT is used as a classifier, the increase of k has obvious influence on detection. We analyze the reason for this because k represents the number of neighbors. After the neighbors increase, the samples that are not on the edge are regarded as neighbors. The generated gap data cannot enhance the spatial distribution edges well.
We set times to range from 1 to 5 with an interval of 1. Figure 6 shows the impact of times on detection.
As shown in Figure 6, the ordinate represents the prediction results of different classifiers with the increase of times. When = 1, the value of the ordinate represents the average value of k from 5 to 100 (with an interval of 5). It can be seen that when the times increases, the test results of various classifiers decrease, among which the GBDT is the most obvious. Analyzing the reason, as the times increase, the number of gap data increases. When there is a lot of gap data, the classifiers overfit. Therefore, generating too much gap data cannot enforce the boundary and even leads to lower detection accuracy. k and times need to be set appropriately. The principle is to generate a small amount of gap data, which can achieve good results and reduce training overhead.

Conclusions
This paper first proposes a Chebyshev distance based Between-class learning algorithm, called CDBC. The method generates "gap data" by calculating the distances between heterogeneous traffic. Gap data can enhance the boundary between small and other samples and optimize the classification performance of the classifier. Second, the detection architecture of darknet based on CDBC is introduced, and we discuss data preprocessing, training and darknet detection. Thirdly, CDBC is used on two datasets, and the experiments test 11 kinds of classifiers in CDBC and without CDBC environments. The experimental results show that when CDBC is applied to the detection, the accuracy of the classifiers can be improved and the best result is 99.99%. The CDBC based Adaboost method is the best. In addition, CDBC is also used to compare with existing sampling methods, and the results show that CDBC is better than others. We also analyze the hyperparameters and conclude that the detection accuracy of the classifiers is significantly improved when the gap data is sampled in a small amount. The proposed method can overcome the difficulties caused by the small number of samples, and can solve the problem of low detection accuracy. We provide a solution for cyberspace security researchers. Moreover, the sampling method (CDBC) can also be extended to the other fields.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.