Detecting and Classifying Darknet Traffic Using Deep Network Chains

,

Ground-true set of class labels y i True labels for the i-th training entitŷ y i Detected label for the i-th training entity approach to characterizing encrypted traffic into categories, such as browsing, streaming, file transfer, etc., is introduced based on the traffic type. Also, a novel approach is presented in [19] for classifying encrypted VPN Internet traffic and identifying applications by transforming basic flow data into an intuitive picture and then applying image classification techniques. A supervised learning approach using correlation-based feature selection and random forest algorithms to detect Tor traffic effectively is presented in [20]. Recently, approaches for detecting VPN and Tor applications together as the real representatives of darknet traffic have become an area of interest. In [21], a dataset that covers VPN and Tor traffic to create a complete dataset that covers a wide range of captured applications and hidden services provided by darknets is presented. Furthermore, the authors present a two-dimensional convolutional neural network (CNN) that utilizes feature selection techniques to characterize darknet traffic, including VPN and Tor applications with high detection rates. This encourages further investigation into methods for effectively detecting and classifying darknet traffic. To do so, this paper presents a two-stage network chain methodology for detecting and classifying darknet traffic.
In the first stage, anonymized darknet traffic, including VPN and Tor traffic related to hidden services provided by darknets, is detected. In the second stage, traffic related to VPNs and Tor services is classified into eight classes: 1) Audio streaming 2) Browsing 3) Chat 4) Email 5) File transfer 6) P2P (peer-to-peer) 7) VOIP (Voice over Internet Protocol) 8) Video streaming The approach in this paper deploys a chain of deep networks to detect and classify darknet traffic while achieving high accuracy that enables practitioners to combat alleged malicious activities and further detect such activities after outbreaks. The primary contributions of this paper are as follows: -Identifying the capability of detecting encrypted traffic covering either VPN or Tor traffic separately, covering a wide range of captured darknet activities; -Demonstrating the effectiveness of the chain of deep networks to detect and classify darknet traffic while achieving higher accuracy measures; -Presenting a deep chain classifier that can accommodate additional classifiers to categorize specific activities further.
The remainder of this paper is structured as follows. Section 2 presents the multilabel deep network chain classifier. The methodology for detecting and classifying darknet traffic is presented in Section 3. The application and evaluation results of the proposed methodology are presented in Section 4. Finally, the conclusions are discussed in Section 5.

Multilabel Deep Network Chain Classifier
Multilabel classification is a supervised learning problem in which one entity can be associated with multiple labels. This is opposed to the traditional problem of single-label classification in which each entity is associated with a single class label. Recently, classifier chain (CC) [22] methods have become a practical approach to multilabel learning problems. The advantage of CCs is that they combine the computational efficiency of binary relevance methods and can take label dependencies into account for classification. A CC trains a binary or multiclass classifier for each label in a prespecified order on the label set. The features used to induce each classifier are extended by the previous labels in the chain-that is, the labels are treated as additional feature attributes to model the conditional dependence between the label and its predecessors. This approach can be adopted for problems related to the detection and classification of entities, where the first model in the chain "detects," and then the next model "classifies." For that, the CC approach is adopted and used in a dedicated formulation concerning other applications. In this formation, a label from a predecessor model is treated as an additional feature attribute; however, only a subset of entities of interest are passed on the chain to the next model to model the conditional dependence between the label and its predecessors. The supervised models considered in the CC are deep networks in which the first deep network "detects." If the entity is of interest, the entity is passed to the following deep network in the chain, which "classifies" the entity. Fig. 1 presents the chain architecture with two deep network models. It should be noted that such architecture can adopt additional models into the chain, and the architecture can be further formulated based on the required application. The main objectives of the methodology presented in this section are: 1) to detect encrypted traffic separately covering both VPN and Tor traffic and a wide range of captured darknet activities, 2) to present an effective chain of deep networks to detect and classify darknet traffic while achieving higher accuracy, and 3) to identify the ability of deep CCs to accommodate additional classifiers in the chain to categorize specific activities of interest.
The detection and classification of darknet traffic are achieved by constructing a chain of two deep network models. The first deep network model is a binary classifier that classifies the entities into normal or suspicious based on the features of the entity. The latter deep network refers to the features of the suspicious entity and categorizes those that are suspicious into different groups based on the activity to which they are related. The methodology of detecting and classifying darknet traffic can be decomposed into three subsequent steps: 1) data preprocessing, 2) deep network CC building, and 3) result evaluation. Fig. 2 presents the general layout of the methodology. To build a classifier that will be able to detect darknet traffic, the data should include both normal and darknet traffic entities. As shown in Fig. 2, the data are to be preprocessed in the first step so that missing and invalid entities are amended or removed. This results in entities carrying the features of the traffic and two labels. The first label (Label1) is a binary class label, where an entity is classified as either normal or darknet traffic. If the entity is classified as darknet traffic, a second multiclass label (Label2) classifies the type of service. Once the data are preprocessed and include the required features and labels, a Deepnet that can detect darknet traffic entities based on their features is built in the second step. In cases in which an entity is detected as darknet traffic (Label1), in the second stage, the type of darknet service is classified based on the features of the entity and considering Label1. It should be noted that only darknet traffic-detected entities are passed to the classification stage. In the last step, the built models (i.e., the detection and classification stages) are evaluated utilizing evaluation metrics that describe the complete performance of the models. The models can be continuously improved based on the evaluation results; parameters can be compared to choose the optimal settings. The details of the aforementioned steps are presented in the following subsections.

Data Preprocessing
A recent dataset that includes both VPN and Tor traffic is used to detect and classify darknet traffic efficiently. The Darknet dataset [21], which combines the ISCXVPN2016 [18] and ISCXTor2017 [23] datasets, is used in this study. The Darknet dataset includes 158,659 entities, of which 24,311 entities are darknet traffic, and 134,348 are normal. Furthermore, darknet traffic entities are categorized into eight hidden services. Each entity's activity is represented by 83 features, including information such as the utilized protocol, total length of the bandwidth packet, bytes/packet flow per second, source IP/port, destination IP/port, timestamp, etc. A darknet traffic entity is an erroneous traffic observed in the empty address space, a collection of globally valid IP addresses that have not been allocated to any hosts or devices. Traffic is not anticipated to enter such a darknet IP space in an ideal, secure network system. The dataset also includes two class labels. The first class label consists of four categorizations: Non-Tor, Non-VPN, Tor, and VPN. The second class label consists of eight traffic activities. It should be noted that network traffic can be collected utilizing data feed platforms such as Apache Spark using Apache Kafka. To formulate the dataset for the construction of a CC that will be able to detect darknet activity and then further classify the type of activity into one of the eight hidden services, each entity in the Darknet dataset is first labeled as "normal traffic" or "darknet traffic" this will be referred to as the detection stage. Then, each entity classified as "darknet traffic" is further classified into one of the eight hidden services. This will be referred to as the classification stage. Table 1 presents the number of normal and darknet traffic entities for the detection stage and the number of entities for the eight darknet hidden services for the classification stage in the Darknet dataset. To detect darknet traffic in general-not specific to known IP addresses-the features are reduced by eliminating flow label features, including flow ID, timestamp, source and destination IP, and ports. This enables us to detect and classify a broad range of darknet activities potentially. The result of this data preprocessing stage is a dataset that includes 158,659 entities, each represented by 64 features, including two-class labels for the detection and classification stages.

Deep Network Classifier Chain Building
In the previous stage, the data were preprocessed for the construction of a CC that will be able to detect darknet activity. The type of activity was further classified into one of the eight hidden services. The deep network CC is constructed by training two deep networks in this stage. The first deep network implements the detection stage. Accordingly, the first deep network is a binary classifier that detects darknet traffic. Entities that do not represent darknet activity are classified as normal activities and are not passed on the CC to the next classifier. However, entities detected as darknet traffic are of interest and are passed on the CC to the next classifier to classify the type of activity. Thus, in cases where a new entity contains darknet activity, the classifier should be able to detect the entity as darknet activity. Further, classify each activity based on its features into Audio streaming, Browsing, Chat, Email, File Transfer, P2P, VOIP, or Video streaming.
It should be noted that the class label resulting from the first classifier is added as a feature attribute to the next classifier in the chain. A second multiclass deep network in the CC that implements the classification stage is trained to accomplish that task. To build a CC that will achieve the two models, pre-trained deep networks [24] are adopted. The trained deep networks are repurposed by their learned knowledge, which includes the layers, weights, and biases. Furthermore, models that achieve higher classification accuracy are fine-tuned to improve their accuracy and classify entities into the correct class labels. This approach is applied in both the detection stage and the classification stage. Each Deepnet is evaluated using performance metrics, and the Deepnet that produces the highest accuracy results for each classifier in the CC is chosen. The parameters of the deep network with tuned values for the detection and classification stages are presented in Tables 2 and 3, respectively.

Performance Evaluation
To evaluate the deep network CC, each classifier is fine-tuned and assessed independently. The accuracy, precision, recall, and F-score metrics are used to assess the performance of the built classifiers. In addition, a performance indicator based on the entities' ground truth and detection probability, namely, the receiver operating characteristic (ROC) curve, is utilized. The ROC curve quantifies the classifier's performance by calculating the area under the curve (AUC).
As mentioned previously, eight class labels are in the second multilabel classifier in the chain (i.e., the classification stage). The performance evaluation is computed for every class label, and the final result is based on the overall average of all classes.
In CC problems, it is essential to include multiple and contrasting evaluation measures due to the additional degrees of freedom that the multilabel setting introduces [25]. For that, per-label and label set-based evaluation, which evaluates the label sets of the CC, are utilized in the experimental evaluation.
When a predicted set of class labels (ŷ) match the ground-true set of class labels (y) exactly (i.e., label set-based evaluation), this is considered the exact match measure is known as the "0/1 loss" measure. In this measure, any label set not detected and classified ideally is given a zero score. N is the number of training entities, y i is the true label for the i-th training entity, andŷ i is the detected label for the i-th training entity.
Furthermore, for CCs, the accuracy measure for a set of N entities can be computed [26]. In contrast with the accuracy measure (i.e., class label set-based evaluation), the F-measure macro averaged over the label-based evaluations for N test entities.

Experimental Results
The methodology presented in the previous section is applied to the Darknet dataset, which consists of 158,659 entities. The data are split into training and test data. It can be observed from Table 1 that due to the multiclass problem in the classification stage, the data are split such that class imbalances are accounted for, and datasets have a proportional number of entities. This will ensure that the data splits have similar class distributions of 70% and 30% for training and test data, respectively. In the following subsections, the application and results of the methodology on the detection and classification stages. Furthermore, the Deep Network Classifier is evaluated.

Detection Stage
The trained deep networks presented in Section 3.2 were applied to the training data portion of the Darknet dataset. Among 200 pre-trained deep networks, the deep network parameters with the tuned values that generate the highest evaluation results are presented in Table 2. The ROC curve plots TPR versus. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both TP and FP. The result of the per-label evaluations and ROC curve for the detection stage deep network are presented in Table 4 and Fig. 3, respectively. It can be observed in the evaluation results that the deep network can detect darknet traffic entities in the detection stage with an accuracy of 96.8% and that most of the misdetection entities were normal activity entities detected as darknet traffic. The probability that the model ranks a random positive entity more highly than the AUC can measure a random negative entity. The AUC-ROC is closer to one, which indicates the higher performance of the built Deepnet model in the detection stage. In cases in which it is desired only to detect darknet activity, completing the detection stage is sufficient to accomplish this task.

Classification Stage
To further classify the detected darknet activity, entities classified as darknet entities are passed along the CC to the classification stage. Similarly, in the detection stage, numerous pre-trained deep networks were applied to darknet-related entities in the Darknet dataset. Furthermore, the deep network's parameters were tuned, and values representing the highest evaluation results are presented in Table 3. The results of the per-label evaluations for entities that have been passed to the classification stage deep network are presented in Table 4. It should be noted that the results of the evaluation metrics of Table 4 were based on the overall average for all eight label classes. The accuracy and recall metrics were relatively high. However, the precision value degraded in cases where the model has to deal with several unbalanced class labels; thus, the accuracy metric needs to be more accurate. In situations characterized by poor performance in either precision or recall, the F-measure is a more valid indication of performance. The classification deep network model achieved a score of 89% in the classification stage. This is the product of 75.8% and 77% precision scores for detecting video-streaming and P2P activities, respectively, as few video-streaming entities and P2P activities were classified as audio streaming. One of the reasons for this misclassification is that Audio and Video streaming is commonly used in shared real-time protocols for streaming, such as the Real-time Transport Protocol (RTP) [27]. Accordingly, the features of those class labels tend to have similar observations, primarily when the same RTP is utilized. This only affects the classification of an entity; however, the entity was detected as a darknet activity in the detection stage. In cases in which it is desired to accurately distinguish between P2P, Audio streaming, and Video streaming entities, an additional classifier in the CC can be added; however, this is beyond the scope of this work.

Deep Network Chain Evaluation
To evaluate the performance of the deep network chain (i.e., label set-based evaluation), the "harsh" loss measure is utilized. This loss measure only considers whether the exact detection and classification labels match. For example, when a darknet activity is accurately detected but misclassified, the loss score considers this a mismatch and is given a zero score. Accordingly, this penalizes the performance evaluation. Furthermore, the F-measure macro averaged over the label set-based evaluations for N test entities is computed. The performance results of the deep network chain were 0.038, 0.96, and 0.91 for loss-measure, accuracy, and F-measure, respectively. In general, classifiers presenting lower loss values and higher accuracy, and F-measure values are considered adequate and can thus be relied upon. Table 5 compares the presented Deep Network Chain with the results of the DeepImage approach [21]. It can be observed from Table 5 that the overall performance of the presented Deep Network Chain outperformed the DeepImage approach. This could be due to the DeepImage method selecting only certain features to build a gray image, then feeding it into a two-dimensional CNN to detect and characterize darknet traffic. However, the superiority of the methodology presented in this work is due to employing two chains of deep networks instead of employing one classifier for detecting and characterizing darknet traffic, where the first deep network detects darknet traffic. The darknet activities are passed into a second classifier for characterization.

Conclusions
This paper presented an approach to detecting and classifying darknet traffic by deploying Deep Network Chains. The first classifier in the chain is a deep network binary classifier that detects darknet activities in the detection stage. Such activities are passed into the second classifier in the chain. The second classifier is a multiclass deep network that categorizes the hidden services and applications in the darknet classification stage. The methodology of this paper was verified on a dataset containing both VPN and Tor traffic. Optimization and parameter tuning were carried out in both stages (i.e., the detection stage and the classification stage) to achieve more accurate results. To evaluate the performance of the deep network chain, adequate evaluation metrics for classifier chains were utilized, including the loss measure. The performance results of the deep network chain were 0.038, 0.96, and 0.91 for loss measure, accuracy, and F-measure, respectively. It was observed that the misclassification is due to the Audio and Video streaming commonly used in shared real-time protocols. However, in cases in which it is desired to distinguish between such activities accurately, the presented deep chain classifier can accommodate additional classifiers. Furthermore, additional classifiers can be added to the chain to categorize specific activities of interest further.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare no conflicts of interest to report regarding the present study.