Darknet traffic classification and adversarial attacks using machine learning

The anonymous nature of darknets is commonly exploited for illegal activities. Previous research has employed machine learning and deep learning techniques to automate the detection of darknet traffic in an attempt to block these criminal activities. This research aims to improve darknet traffic detection by assessing a wide variety of machine learning and deep learning techniques for the classification of such traffic and for classification of the underlying application types. We find that a Random Forest model outperforms other state-of-the-art machine learning techniques used in prior work with the CIC-Darknet2020 dataset. To evaluate the robustness of our Random Forest classifier, we obfuscate select application type classes to simulate realistic adversarial attack scenarios. We demonstrate that our best-performing classifier can be degraded by such attacks, and we consider ways to effectively deal with such adversarial attacks. © 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

. Layers of the Internet ( Demertzis et al., 2021 ). correlate the resulting RF confusion with our obfuscation technique for three attack scenarios, assuming few limitations for traffic modification. We then assess the strength of our obfuscation technique with one defense scenario, by which we demonstrate that we can restore the performance of the RF classifier despite duress. We find that sufficient statistical knowledge of network traffic features can empower either the classification or obfuscation tasks.
A high-level overview of our experiments is provided in Fig. 2 . After some limited initial data cleaning, for each experiment, we partition the dataset under consideration into training and validation sets. In the base case, a specific machine learning model is trained, based on the training set, with the validation set used to compute accuracy and F1-score statistics. As mentioned above, we consider data augmentation using SMOTE. We also conduct experiments using the generator module of AC-GAN to produce synthetic data, which can be viewed as another form of data augmentation. The three adversarial attack scenarios mentioned above assume that the attacker can manipulate the training data, the validation data, or both. In Section 4.2 , we discuss these attack scenarios in detail, and explain why they are realistic threats.
The remainder of this paper is structured as follows. Section 2 gives a brief background discussion of Tor and VPN, and considers related work on darknet traffic detection. Section 3 describes the dataset used in our experiments and outlines our experimental methodology. Section 4 provides background knowledge on the machine learning techniques used in our experiments and gives implementation details. Section 5 discusses the results of our experiments. Lastly, Section 6 summarizes our research and considers possible directions for future work.

Background
In this section, we first discuss the two broad categories of data in our dataset, namely, Tor and VPN traffic. Then we discuss the most relevant examples of related work.

The onion router
Initially, The Onion Router (Tor) was a project started by the United States Navy to secure government communication. Since 2006, Tor has become a nonprofit with thousands of servers (called relays or relay nodes) run by volunteers across the world ( Tor Project History ). Tor clients anonymize their TCP application IP addresses and sessions keys, sending encrypted application traffic through a network of relays ( Sarkar et al., 2020 ). An example client application is the Tor Browser, which allows users to browse the web anonymously.
Tor generally selects a relay path of three or more nodes and encrypts the data once for each node using temporary symmetric keys. The encrypted data hops from relay to relay, where each relay node only knows about the previous node and the next node along the path. This design makes it difficult to trace the original identity of Tor clients. Each relay removes a layer of encryption, so that by the last relay, the original data is forwarded to the intended destination as plaintext. Tor then deletes the temporary session keys used for encryption at each node, so that any subsequently compromised nodes cannot decrypt old traffic ( Dingledine et al., 2004 ).

Virtual private networks
Virtual Private Networks (VPN) are used to ensure communication privacy for individuals or enterprises, and can serve to separate private address spaces from the public Internet. VPN software disguises client IP addresses by tunneling encrypted communications through a trusted server, which acts as a gateway or proxy to route client traffic to the broader network space. Client data is anonymized behind VPN server credentials before being forwarded to an intended destination, which may be either public or private. Any response traffic is sent back through the VPN server over the encrypted connection for the client to decrypt, ensuring anonymity between the client and recipient. Third parties, such as Internet Service Providers (ISP), will only see the VPN server as the destination of client communications. There are many forms of VPN. Some operate at the network layer, others reside at the transport or application layer ( Venkateswaran, 2001 ).

Related work
Several researchers have considered the problem of detecting darknet traffic. However, there are limited public darknet datasets available. The CIC-Darknet2020 dataset used in the experiments reported in this paper was generated by Lashkari et al. (2020) . This dataset was also used in prior research, including ( Demertzis et al.;Iliadis and Kaifas 2021;Sarwar et al. 2021 ), and it has become a well-known darknet traffic dataset due to its accessibility. In their research, Lashkari et al. (2020) grouped Tor and VPN together as darknet traffic, while non-Tor and non-VPN were grouped as benign traffic (clearnet). They created 8 × 8 grayscale images from 61 select features and used Convolutional Neural Networks (CNN) to classify samples in the dataset. Their CNN model achieved an overall accuracy of 94% classifying traffic as darknet or benign and 86% accuracy classifying the application type used to generate the traffic. The application traffic was broadly labeled as browsing, chat, email, file transfer, P2P, audio streaming, video streaming, or VOIP.
The research reported in Sarwar et al. (2021) consisted of classifying traffic and application type by combining a CNN and two other deep-learning techniques: Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). They addressed the issue of having an imbalanced dataset by performing Synthetic Minority Oversampling Technique (SMOTE) on Tor, the minority traffic class. They used Principle Component Analysis (PCA), Decision Trees (DT), and Extreme Gradient Boosting (XGBoost) to extract 20 features before feeding the data into CNN-LSTM and CNN-GRU architectures. Their CNN layer was used to extract features from the input data, while LSTM and GRU did sequence prediction on these features. CNN-LSTM in combination with XGBoost as the feature selector produced the best F1-scores, achieving 96% classifying traffic type and 89% classifying application type.
The study ( Iliadis and Kaifas, 2021 ) focused on just traffic type from the CIC-Darknet2020 dataset. They used k -Nearest Neighbors ( k -NN), Multi-layer Perceptron (MLP), RF, DT, and Gradient-Boosting Decision Trees (GBDT) to do binary and multi-class classification. For binary classification, they grouped the data into two classes, namely, benign and darknet, similar to Lashkari et al. (2020) . For the multi-class problem, they used the original four classes of traffic type (Tor, non-Tor, VPN or non-VPN). They found that RF was the most effective classifier for traffic type, yielding F1-scores of 98.7% for binary classification and 98.61% for multiclass classification.
Using the same dataset, the authors of ( Demertzis et al. ) further broke down the application categories into 11 classes and used Weighted Agnostic Neural Networks (WANN) to classify the data. Unlike regular ANNs, WANNs do not update neuron weights, but rather update their own network architecture piece-wise. WANNs rank different architectures by performance and complexity, forming new network layers from the highest ranked architecture. Their best WANN model achieved 92.68% accuracy on application layer classification.
The UNB-CIC Tor and non-Tor dataset, also known as ISCX-Tor2016 ( Lashkari et al., 2017 ), was used by Sarkar et al. (2020) to classify Tor and non-Tor traffic using Deep Neural Networks (DNN). They built two models, DNN-A with 3-layers and DNN-B with 5layers. DNN-A classified Tor from non-Tor samples with 98.81% accuracy, while DNN-B achieved 99.89% accuracy. For Tor samples, they built a 4-layer Deep Neural Network to classify eight application types. This model attained 95.6% accuracy.
In another study, Hu et al. (2020) generated their own dataset, capturing darknet traffic across eight application categories (browsing, chat, email, file transfer, P2P, audio, video and VOIP) sourced from four different darknets (Tor, I2P, ZeroNet, and Freenet). They used a 3-layer hierarchical approach for classification. The first layer classified traffic as either darknet or normal. In the second layer, samples classified correctly as darknet were then classified by their darknet source. The third layer then classified application type for each of the darknet sources. The techniques ( Hu et al., 2020 ) used for classification include Logistic Regression (LR), RF, MLP, GBDT, Light Gradient Boosting (LightGB), XGBoost, LSTM, and DT. Their hierarchical method attained 99.42% accuracy in the first layer, 96.85% accuracy in the second layer and 92.46% accuracy in the third layer. Table 1 provides a summary of the prior work presented in this section. We note that the research in Iliadis and Kaifas (2021) , Lashkari et al. (2020) , Sarwar et al. (2021) use the same dataset that we consider in this paper.

Methodology
The primary goal of this research is to improve upon the stateof-the-art classification of darknet traffic by exploring the performance of Support Vector Machines (SVM), Random Forest (RF), Gradient-Boosting Decision Trees (GBDT), Extreme Gradient Boosting (XGBoost), k -Nearest Neighbors ( k -NN), Multilayer Perceptron (MLP), Convolutional Neural Networks (CNN), and Auxiliary Classifier Generative Adversarial Networks (AC-GAN) as classifiers. We experiment with different levels of SMOTE during a preprocessing phase, oversampling the minority classes of the CIC-Darknet2020 dataset to assess the effects of data augmentation and class balance on classifier performance. We also consider using the AC-GAN generator for data augmentation, but we find that it is ineffective for this purpose. We experiment with representations of the darknet traffic features as 2-dimensional grayscale images for CNN and AC-GAN. Then we test the robustness of our best-performing classifier in obfuscation scenarios, which serve to simulate adversarial attacks, assuming both the perspectives of an attacker and defender.
In our adversarial attacks, we apply statistical knowledge of the dataset to obfuscate specific data features, disguising one or more classes as others. We explore three scenarios whereby we either obfuscate the training data, the validation data or both. Obfuscating just the validation data simulates an attack scenario in which traffic data is disguised while our classifier is yet unaware of the attack, and thus we can only apply previously trained models without a chance to learn from the obfuscation. Obfuscating just the training data simulates a scenario in which an attacker has accessed our training data to poison it, such that we train our classifier with malformed assumptions or outright malicious supervision. A third scenario supposes we collect some of the obfuscated traffic data before training our classifier, and thus have a chance to update our classification models to detect obfuscated validation data.

Dataset
The CIC-Darknet2020 dataset ( Lashkari et al., 2020 ) is an amalgamation of two public datasets from the University of New Brunswick. It combines the ISCXTor2016 and ISCXVPN2016 datasets, which capture real-time traffic using Wireshark and TCPdump ( Gil et al., 2016;Lashkari et al., 2017 ). CICFlowMeter ( Lashkari, 2018 ) is used to generate CIC-Darknet2020 dataset features from these traffic samples. Each CIC-Darknet2020 sample consists of traffic features extracted in this manner from raw traffic packet capture sessions. CIC-Darknet2020 consists of 158,659 hierarchically labeled samples. The top level traffic category labels consist of Tor, non-Tor, VPN, and non-VPN. Within these top level categories, samples are further categorized by the types of application used to generate the traffic. These type subcategories are audiostreaming, browsing, chat, email, file transfer, P2P, video-streaming, and VOIP. Table 2 details the applications that are used to generate each type of traffic at the application level.

Preprocessing
The CIC-Darknet2020 dataset has samples with missing data, more specifically, feature values of "NaN ". We remove samples with these values in our data cleaning phase. As shown in Table 3 , there are significantly less Tor samples compared to the other traffic categories. Prior work using this dataset eliminated CICFlowMeter the flow labels, namely, Flow Id , Timestamp , Source IP and Destination IP . The Flow Id , and Timestamp , which are also eliminated in our research as well. However, to obtain as much information as possible from the CIC-Darknet2020 dataset, we separate each octet of the source and destination IP addresses into their own feature columns. Preliminary tests run on the dataset with and without these IP octet features indicate an improvement in the performance of the classifiers when this IP information is retained. Thus our dataset contains 72 features total after this preprocessing step.   The CIC-Darknet2020 dataset was scaled by min-max normalization, which applies the equation to every value in each feature column. Note that this serves to scale the feature values between 0 and 1. We also apply min-max normalization to our IP octet feature columns.

Data balancing
The CIC-Darknet2020 dataset does not have balanced sample counts among traffic and application classes, as shown in Tables 3 and 4 . To explore the effect of reducing this imbalance on the classification task, we oversample each minority class using SMOTE. SMOTE interpolates linearly between feature values to produce new samples ( Bhagat and Patil, 2015 ). We experiment with the following levels of oversampling: 0% (no SMOTE), 20%, 40%, 60%, 80% (partial SMOTE), and 100% (full SMOTE). SMOTE is performed on all classes with less than the oversampling threshold as compared to the class with the largest sample count. Note

Data representation
SVM and RF both use each the dataset samples in their original format, which is a 1-dimensional array. However, we reshape each sample to be 2-dimensional for CNN and AC-GAN. Intuitively, the data is reshaped as 9 × 9 grayscale images, where each of our 72 features is represented as a single pixel with the remaining pixels produced by zero padding. The pixels are ordered as their respective features appeared in the CIC-Darknet2020 dataset, starting at the top left corner of the image as shown in Fig. 3 , where each row represents samples from an application class, color-coded for readability.
Both CNN and AC-GAN convolve local structures within the 2-D images, so adjacent pixels play an important role in classification. Therefore, we experiment with strategies to reorder the data to achieve better performance. We order the pixels by feature importance-as determined by our Random Forest classifierstarting at the top left corner of the image, and also reorganize the pixels spiraling outward from the center of the image. This latter strategy tends to group pixels with larger values toward the center of each image, as shown in Fig. 4 .

Data augmentation experiment
We experimented with AC-GAN as an alternative to SMOTE, with the goal of generating realistic artificial samples that can be  used to augment our dataset. Again, we use data augmentation to address the issue of class imbalance. However, we abandoned this approach as we found that the fake images generated by AC-GAN are consistently detectable by a CNN model with accuracy ranging from 99% to 100%. We believe that the failure of our AC-GAN to produce realistic fake images is due to the depth of the AC-GAN neural network architecture, which is constrained by the input image size. In any case, we were unsuccessful in our attempt to use AC-GAN to augment our data.
An example of four fake samples compared to real samples can be found in Fig. 5 . The fake samples in this figure may appear to be useful but, again, a CNN can distinguish the fake from the real with essentially 100% accuracy. This clearly shows that from a machine learning perspective, the fakes samples are not sufficient for data augmentation.

Evaluation metrics
In our experiments, we use accuracy and F1-score to measure the performance of each classifier. Accuracy is computed as the total number of correct predictions over the number of samples tested. The F1-score is the weighted average of precision and recall metrics, which is better for unbalanced datasets like CIC-Darknet2020. Similar to accuracy, F1-scores fall between 0 and 1, with 1 being the best possible. The F1-score is computed as Precision calculates the ratio of samples classified correctly for the positive class, while recall measures the total number of positive samples that were classified correctly. Precision and recall are com-  respectively.

Implementation
This section details the implementation of the experiments that we mentioned in Section 3 . All experiments are coded in Python.
The Imblearn library (imblearn) is used to implement SMOTE to balance the dataset, while the package Scikit-learn ( Scikitlearn: Machine Learning in Python ) is employed to run most of the experiments, with the exceptions being that the Tensorflow and Keras libraries are utilized to implement CNN and AC-GAN. From the Scikit-learn library, the metrics module is used to evaluate the F1-scores and accuracy of the classifiers and the StratifiedKFold function is applied to perform 5-fold cross validation. Graphs are generated with the Matplotlib and Seaborn libraries, with the exception of the confusion matrices and bar graph, which are typeset directly in L A T E X using PGFPlots.
All experiments in this research are executed on one of two personal computers, as detailed in Table 5 . We exploit a graphics processing unit (GPU) in the second computer to decrease the training time of our more computationally demanding experiments, that is those using neural networks to process 2-D image representations.

Overview of classification techniques
This section briefly describes the machine learning and deep learning concepts that we apply to classification in our experiments. These include the boosting techniques of GBDT and XG-Boost, as well as k -NN, MLP, SVM, RF, CNN, and AC-GAN.

Boosting techniques
Boosting is a general technique where a collection of weak classifiers are combined to produce a stronger classifier. Gradient-Boosting Decision Trees (GBDT) assign weights to decision trees based on residuals (i.e., gradient calculations). Extreme Gradient Boosting (XGBoost) is a slightly modified-and highly efficientimplementation of the GBDT technique. XGBoost has performed well in numerous machine learning competitions ( Synced, 2017 ).
In our GBDT experiments, we employ the log loss function, while the learning rate is α = 0 . 1 and the number of estimators is 100. For our XGBoost experiments, the learning rate is selected to be α = 0 . 3 , the maximum depth of the trees is 6, and we employ uniform sampling.

k -Nearest Neighbors
As the name suggests, in k -Nearest Neighbors ( k -NN), samples are classified based the k nearest samples in the training set. There is no explicit training required in k -NN, and hence no algorithm can be simpler, at least in terms of training. In spite of-or, perhaps, because of-its simplicity, there exist strong error bounds for k -NN. However, the technique is sensitive to local structure and, in particular, for small values of k , overfitting is common. Based on small-scale experiments, we use k = 5 for all k -NN experiments reported in this paper.

Multilayer perceptron
Multilayer Perceptrons (MLP) are feedforward networks that generalize basic perceptrons to allow for nonlinear decision boundaries. This is somewhat analogous to the way that nonlinear SVM generalize linear SVMs. In a sense, MLPs are the simplest useful neural networking architecture, and hence they are sometimes referred to simply as Artificial Neural Networks (ANN). In our MLP experiments, we use an architecture with 100 hidden layers, rectified linear unit (ReLu) activation functions, the Adam optimizer, and a learning rate of α = 0 . 0 0 01 .

Support vector machines
Support Vector Machines (SVM) are supervised machine learning models frequently used for classification. An SVM attempts to find one or more hyperplanes to separate labeled training data while maximizing the margin of the decision boundaries between classes. The data must be vectorized into linear feature sets, but non-linear data can also be encoded with some success. Scaling the feature values across training samples allows coefficients of the hyperplanes (weights) to be ranked by relative importance. SVMs rely on the so-called kernel trick to map data into a higher dimensional space, which can yield nonlinear decision boundaries in the input space. The idea behind the kernel trick is that in higher dimensions, it is generally easier to find hyperplanes to separate classes ( Stamp, 2022 ). For our research, we perform preliminary tests to determine the best kernel for our dataset, with the result being the Gaussian radial basis function (RBF).

Random forest
Random Forest (RF) is an ensemble method that generalizes Decision Trees (DT). While a DT is a simple and efficient classification algorithm, it is highly sensitive to variance in the training data and hence prone to overfitting. RF compensates for these deficiencies by generating many subsets of the dataset, then randomly selecting features (with replacement) and trains a DT for each subset. This process is called bootstrapping. To classify, RF takes the majority vote from all resulting DT in a process called aggregation. Together bootstrapping and aggregation is referred to as bagging ( Misra and Li, 2020;Stamp, 2022 ). RF also enables us to rank the importance of features based on the mean entropy within the component DTs. Feature importance tells us how influential each feature is when classifying samples with the RF. Based on small-scale experiments, we found that the default hyperparameters in Scikit-learn yielded the best results; see (sklearn.ensemble.RandomForestClassifier) for the details.

Convolutional neural networks
Convolutional Neural Networks (CNN) are a unique type of neural network that focus on local structures, making them ideal for image analysis. CNNs are composed of an image input layer, convolution and pooling layers and a fully-connected output layer that produces a vector of class scores. Convolutional and pooling layers are the fundamental components of any CNN architecture. In convolutional layers, the output of the previous layer (or the raw image in the initial convolutional layer) is convolved with randomized filters to produce local structure maps that are joined to create the output of the layer. In the convolutional process, the filter windows slide across the input image, thus emphasizing local structure, and providing a degree of translation invariance. The components of each filter are learned when training a CNN. Pooling layers decrease total training time by reducing the dimensionality of the resulting feature maps, concentrating effort on the most signifcant features ( Convolutional Neural Networks for Visual Recognition; Lashkari et al. 2020 ). For this research, we use max pooling.
Our CNN architecture is based on that described in ( Lashkari et al., 2020 ). We experiment with various hyperparameters, testing all combinations of the following in a grid search.
• Initial number of convolution filters (9, 32, 64, 81) • Filter size ( 2 × 2 , 3 × 3 ) • Percentage dropout (0 . 2 , 0 . 5) • Number of nodes in the first dense layer (72, 256) All these architectures yield accuracies within the range of 86% to 88% when classifying application type. Therefore, we select the architecture that produces the highest accuracy. Our select CNN architecture is illustrated in Fig. 6 . Note that we use Adam for our optimizer and sparse categorical cross entropy for our loss function.
Dropout is a common technique used to combat overfitting in neural networks with fully-connected layers. However, it is found to be not as effective with convolution layers. A better regularization technique for CNN is to "cut out" sections of the input images. Such cutouts force CNN to learn from the other parts of an image during training, which tends to activate filters that would otherwise atrophy. It is comparative in effect to dropouts except that it operates on the input stage rather than the intermediate layers ( DeVries and Taylor; Li et al. 2021 ). We implement cutouts by creating feature masks of equivalent size to our input image. We experiment with different cutout sizes including 2 × 2 , 3 × 3 , and 4 × 4 and randomize the position of the cutout within the mask. Refer to Fig. 7 for some examples of masks with 3 × 3 cutouts. Our cutout experiments are discussed in detail in Section 5.2 , below.

Auxiliary-classifier generative adversarial network
Generative Adversarial Networks (GAN) are comprised of two neural network architectures-a generator and a discriminatorthat compete in a zero-sum game during training. The generator takes noise from a latent space as input and produces images that feed into the discriminator. The discriminator is given both real and generated images and is tasked to classify them as either real or fake. The discriminator error is then fed back into the generator to improve its image generation. AC-GAN is an extension of this base GAN architecture, taking a class label as additional input to the generator while predicting this label as part of the discriminator output. The objective of the AC-GAN generator is to minimize the ability of the discriminator to distinguish between real and fake images and also maximize the accuracy of the discriminator when predicting the class label ( Mudavathu et al., 2018;Nagaraju and Stamp, 2021 ). Besides using the AC-GAN generator in data augmentation experiments, we also explore the secondary class prediction output of the discriminator as a classifier.
Our AC-GAN architecture is inspired by the ImageNet model described in ( Odena et al., 2017 ). However, since that architecture was built for image sizes 32 × 32 or larger, we modify that architecture to accommodate our 9 × 9 image size by reducing the number of convolutional and transposed convolutional layers in the discriminator and generator, respectively.
We fine-tune our AC-GAN hyperparameters by experimenting with the following.
We feed training data to our AC-GAN model in batches of 64 samples. Batch normalization (BatchNorm) layers are applied between convolutional layers to regularize the training gradient step size. BatchNorm is thought to smooth local optimization steps and stabilize training, thereby accelerating convergence of GAN models ( Santurkar et al., 2018 ).

Adversarial attacks
Our adversarial attacks rely on obfuscation, which serves to disguise application classes based on applied probability analysis. We select application classes to disguise as other classes based on minimum and maximum sum statistical distance between all class features, as specified in Algorithm 1 .
We also select a third class transformation to perform based on maximal classifier confusion, whose sum statistical distance between class features is notably low, but not the minimum between classes. We ensure our class transformation can be decoded by encoding features with a deterministic algorithm, given here as Algorithm 2 . We impose no additional restrictions on feature transformation.
We start by generating normalized histograms of feature values per class to assess the probability at which values occur within each class. To decide which classes to obfuscate, we examine the sums of the distances between feature probability distributions from each class to each other class. We use the cdist function of the scipy Python library to calculate the Euclidean distance between probability distributions. This provides an estimate of the overall difference between classes while considering all feature probability distributions. In the case of application type, this yields the 8 × 8 array in Table 8 , where the Class numbers correspond to those in Table 4 , above.
From Table 8 , we observe that class 0 is most different from class 5 and class 3 is most similar to class 7. We pick the classes with the minimum and maximum sum of statistical distances between features, changing class 0 (audiostreaming) to class 5 (P2P) and class 3 (email) to class 7 (VOIP). We also examine the confusion matrix for our best-performing classifier, RF, which is shown in Fig. 8 . RF is observed to be most confused between class 2 (chat) and class 3 (email), so we decide to additionally obfuscate class 2 with class 3. We arbitrarily choose to transform lower numbered classes to higher numbered classes, e.g., disguising class 2 as class 3 instead of class 3 as class 2. Our obfuscation algorithm first calculates the difference in class probability distributions (DCPD) for each feature between the two classes under consideration, where the classes are denoted as A and B, and sorts each distribution from maximum to minimum. Intuitively the index of each DCPD maximum corresponds to each feature value most probably belonging to the positive class A while the minimum corresponds to each feature value most probably belonging to the negative class B. To obfuscate a sample, we then transform individual feature values by subtracting the difference in bin thresholds between the original feature bin and a target bin for obfuscation. To choose target bins for this transformation, we create a 1-to-1 map of the sorted indices of each DCPD with a reverse sort of the same DCPD. This ensures a transformed sample feature could be decoded later given the feature DCPD for a class obfuscation vector. An example visualization of the DCPD bin mapping for the transformation of the most common feature 0 values from class 2 to class 3 is provided in Section 4.2.1 , below.
Reversing the 1-to-1 bin map facilitates decoding of obfuscated class sample feature values back to their original values. To do this we add back the same difference in bin thresholds which we sub- tracted earlier, thus applying each feature DCPD between known classes as a decoder key to undo an expected class obfuscation for a particular feature. To test this method of class obfuscation, we performed the three adversarial attacks summarized in Table 9 , with RF as the classifier.

An obfuscation example
To illustrate Algorithm 2 , we will walk through a simple example where we are given a sample from class 2 and we want to transform this sample to look more like class 3. Let us start with the first feature, feature 0. We note the value of this feature for class 2; call this value v . Suppose, for example, that v = 0 . 178142 .
We allocate 100 equal-width bins ranging from 0 to 1, so that  Simulates an attack on our training data, poisoning the classifier 3 Simulates a novel defense where we train our model on some obfuscated data bin b 0 corresponds to values 0.00 to 0.01 and so on. Given the value of v , we find the bin that v falls into. The value v = 0 . 178142 is in bin b 17 , which contains values between 0.17 to 0.18. Bin b 17 is indicated by the red arrows in Fig. 9 . We then flip the sorted DCPD index at b 17 to locate our target bin, indicated by the black arrows in Fig. 9 . This target bin b 58 , which contains values between 0.58 to 0.59. To obfuscate, we subtract the difference between b 17 and b 58 from v . In this example, our new transformed which falls into the target bin b 58 . We repeat this for all the features to transform the sample from class 2 to class 3. Note that this obfuscation technique is designed to maximize the effectiveness of a simulated adversarial attack. Our approach ignores practical limitations on the ability of attackers to modify the statistics of the data. Hence these simulated attacks can be considered worst-case scenarios, from the perspective of detecting darknet traffic under adversarial attack.

Results and discussion
In this section, we consider a wide range of experiments. First, we determine which of the three 2-D image representation techniques discussed in Section 5.1 is most effective. Then we consider the use of cutouts, which can serve to reduce overfitting and improve accuracy in CNNs. We then turn our attention to the imbalance problem, with a series of SMOTE experiments. We conclude this section with an extensive set of experiments involving various adversarial attack scenarios.

Data representation experiments
We evaluate CNN and the AC-GAN discriminator given different 2-D pixel representations of the data features. All of our 2-D representations of the data are of size 9 × 9 , where each pixel is a feature. The pixels in the original representation follow the order that the features appear in the CIC-Darknet2020 dataset. We hypothesize that grouping the pixels together would have a positive effect on the performance of our classifiers since convolutions operate on local structures. Our results show that CNN performs best when the pixels are sorted by RF feature importance and then grouped together at center of the image. However, this is not true for the AC-GAN discriminator. AC-GAN does better using the original data representation, contrary to our hypothesis. Table 10 shows the results for these experiments.

Cutout experiments
Initially, our CNN model is able to achieve 88% accuracy classifying application type within 15 epochs. However, we notice that overfitting starts to occur the longer we run our model. To reduce overfitting, we apply cutouts to the training data. We experiment with different cutout sizes: 2 × 2 , 3 × 3 , and 4 × 4 . We observe that cutouts allow our CNN to train for a longer period of time without overfitting. The loss graphs in Fig. 10 show how the CNN model overfits after 20 epochs in the original execution but does not overfit with cutouts. There is little difference in the effects of applying 2 × 2 compared to 3 × 3 cutouts. Both delay overfitting at the same rate and the accuracies for both linger at 88%. Notably, we witness a 1% decrease in accuracy with 4 × 4 cutouts. As our images are only 9 × 9 pixels, a 4 × 4 cutout likely deletes too much information from the image, negatively affecting the accuracy. While cutouts address the issue of overfitting, we find that more training does not significantly improve the performance of CNN on the dataset under consideration. Thus, we do not employ cutouts in the CNN results reported below.

SMOTE Experiments
We compare the performance of our classifiers with various levels of SMOTE, performing SMOTE to oversample the training data before training each classifier for both cases, that is, traffic type and application type. The results from these experiments appear in Tables 11 and 12 , respectively, where the best result for each  to 2% in each case. Note also that the MLP results are the poorest in every case. We conclude that for the problem under consideration, SMOTE is of some value for fine tuning models. Our RF model without SMOTE outperforms the state-of-theart F1-scores for both traffic and application classification tasks. We observe a 1.1% improvement for traffic classification as compared to Iliadis and Kaifas (2021) , where they also found RF to be their best classifier. The study ( Iliadis and Kaifas, 2021 ) only classified traffic type, thus no application type performance is available for comparison. For application classification, our RF model achieved a 3.2% increase over ( Sarwar et al., 2021 ). In addition, our CNN model outperformed the CNN results in Lashkari et al. (2020) by 2.8% and is within 0.2% of the more complex and costly CNN-LSTM results in Sarwar et al. (2021) . We are only able to compare classification results for application type with ( Lashkari et al., 2020 ) because they approach traffic type classification as a binary problem while we address it as a multiclass problem. Table 13 summarizes the best performance of our classifiers in comparison to relevant prior work, where the best results in the Traffic and Application columns are boxed. Overall, RF is our  best-performing classifier and MLP and k -NN perform the worst. Also of note is the fact that the AC-GAN classifier is one of the best performing models in the traffic classification problem, but it performs relatively poorly in the application classification task.

Adversarial attack experiments
With improvement in the accuracy of darknet traffic detection by machine learning and deep learning techniques, it is realistic to anticipate that attackers will attempt to find ways to circumvent  detection by modifying the profile of their application traffic. For example, someone pirating copyrighted media with P2P applications might disguise their illegal activity as VOIP traffic to avoid prosecution. We show obfuscation of traffic in this fashion can be accomplished by modifying traffic feature values, understanding that this process is most feasible and desirable at the application layer. Also, if an attacker were to discover the methods we use for classification and pollute our training data, then our classifiers could be compromised, allowing the attacker to avoid detection without modifying any of their traffic features. For this experiment we assume the role of an attacker on the network, with the goal of modifying traffic features such that classes are incorrectly classified or entirely undetected. This could represent covert illegal activity that an attacker wishes to hinder the detection of, with common examples being P2P or file-transfer applications. Realistically, traffic features common to one application class could be modified at the application layer to appear more similar to other application classes. An attacker could do this by writing a custom overlay application to change various features, such as the number of packets sent, their communication intervals, port assignment, etc. In our experiments, we disguise class 0 as class 5 (originally the most different), class 2 as class 3 (the classes which most confused our RF classifier) and class 3 as class 7 (originally the most similar). In attack scenario 1, we train our RF classifier on the original application class data, then test the same model with an obfuscated class in the validation dataset. This represents a hypothetical scenario where an attacker modifies the traffic features of one class at the application layer, perhaps with an overlay application. We demonstrate that our method of obfuscation is able to defeat our best classifier in this scenario, significantly reducing detection of the obfuscated class, as well as overall classifier accuracy. Before obfuscation, RF classifies application classes with an accuracy of 92.3% without SMOTE. After obfuscation of the three class choices mentioned in the previous paragraph, the overall RF accuracy for application classification without SMOTE decreases to 80.8%, 85.4%, and 88.7%, respectively.
The confusion matrices in Fig. 11 show that RF consistently misclassifies each class we obfuscate, actually detecting no samples in the case of an obfuscated class 3, where the dashed circle indicates the class we are obfuscating and the solid circle indicates the class we intended it to appear as. However, RF did not misclassify classes 0 and 3 as the expected classes 5 and 7, respectively. Instead, the confusion matrices (a) and (c) in Fig. 11 reveal that RF mostly categorizes class 0 and 3 as class 6 and 2, respectively. It may be relevant that our obfuscation method does not account for any interdependence between traffic feature values, obfuscating each feature independently.
In attack scenario 2, we train our RF classifier with an obfuscated class in the training dataset, then test the model with the original application class data. We consider a hypothetical scenarios where an attacker entirely poisons our training data, perhaps by injecting malware into our database or by intercepting our traffic capture data stream. We find the attacker could prevent an entire class from being predicted by our best classifier when the training data for a class is entirely obfuscated. We see this trend in the all three confusion matrices in Fig. 12 , where in each case, the dashed circle indicates the class we are obfuscating and the solid circle indicates the class we intended it to appear as. Notice that the entire row in the confusion matrix is zeroed out, indicating that the class was never predicted for classification by RF. Similar to attack scenario 1, the overall RF accuracy decreases to 82.0%, 86.5%, and 89.4% respectively, for application classification without SMOTE. As the obfuscated class is never considered for prediction by RF, in this scenario, we observe a lesser overall accuracy decrease as compared to attack scenario 1.
In attack scenario 3, we train our RF classifier with the same obfuscated class in both the training dataset and the validation dataset. We obfuscate only a small portion of the training data while still obfuscating all of the validation data for each of class 0, 2, and 3. We experiment with the percentage of training data we obfuscate. This represents a hypothetical scenario where the obfuscation algorithm has been obfuscating network traffic long enough to pollute a small portion of a network traffic population. A defender then updates the classifier to include this small portion of obfuscated class data at training time, with increasing exposure to the obfuscated data over time. As our dataset is split into 80% training data and 20% validation data, we decide to limit the training dataset exposure of obfuscated class data to 20% of the total training dataset. We choose to decrement this value logarithmically with three total sub-scenarios representing 0.2%, 2%, and 20% obfuscation exposure, expecting that with more exposure to the obfuscated class data, our classifier will adapt and outperform the obfuscation algorithm to correctly classify the obfuscated class in our validation dataset. We find that 20% exposure of our obfuscation algorithm to the RF training data is sufficient for RF to predict the disguised classes with high accuracy, defeating our obfuscation technique as shown in Table 14 . Note that the overall accuracies reported for attack scenario 3 are higher than our RF benchmark score of 92.2%. However, we modify the validation dataset in both attack scenarios 1 and 3, so the resulting accuracies of those scenarios cannot be directly compared to the results of prior work. Our results with lower exposure levels of 2% and 0.2% reveal a trend-of the classes tested, class 0 appears to be the most difficult for our algorithm to obfuscate, while class 3 appears to be the easiest to obfuscate. Class 2 is somewhere in between, providing a loose correlation to our metric of statistical distance between classes and the performance of our obfuscation algorithm. We observe this trend in Table 14 under Class Accuracy for attack scenario 3.

Conclusion and future work
In this research, we classified the CIC-Darknet2020 network traffic samples using a wide variety of classifiers. We classified based on four traffic classes and eight application classes, while fine tuning the classifier hyperparameters. We experimented with different levels of SMOTE to assess class imbalance in the dataset and explored 2-D representations of the traffic features for CNN and AC-GAN. We also approached the issue of darknet detection adversarially, from the perspective of an attacker hoping to confuse our best classifier. We demonstrated that we could effectively obfuscate application class traffic features. We then correlated the underlying statistics of the CIC-Darknet2020 dataset to the performance of this algorithm assuming specific hypothetical attack scenarios for added realism.
Among the tested machine learning classifiers, Random Forest was found to be the most proficient at classifying darknet traffic for both traffic and application types. It yielded 99.8% F1-score for traffic classification and 92.2% F1-score for application classification, outperforming the state-of-the-art studies on CIC-Darknet2020 ( Iliadis and Kaifas, 2021;Sarwar et al., 2021 ). Figure 13 provides a visual comparison of our best results with those of prior work.
Our research was limited by the availability of darknet traffic datasets. We selected the CIC-Darknet2020 dataset because it is frequently cited and publicly accessible; however the dataset suffers from a substantial imbalance. We attempted to compensate for this class imbalance by generating artificial samples with AC-GAN and SMOTE. The artificial SMOTE samples marginally improved our classification results. Seeking to improve the quality of artificial samples, we assessed AC-GAN as a sample generator. However, our AC-GAN-generated samples were not useful for data augmentation purposes. An approach that future researchers might consider is to use clustering to group samples within a class, then train one GAN per cluster to generate samples. Other variations of GAN might also be better suited for multiclass sample generation and could conceivably generate more realistic samples.
We kept our obfuscations fairly basic, with the goal being to demonstrate that we could confuse our best classifier, with few restrictions imposed on the hypothetical attacker. Under more realistic attack scenarios, it may not be possible to so easily modify features which define darknets such as Tor and VPN, but it would be possible to obfuscate traffic features at the application layer such as those produced by CICFlowMeter analysis. We introduced a loose correlation to one statistical metric, an independent sum of distances between DCPD across all sample features. We noted that 2 out of the 3 classes we chose to obfuscate were misclassified not as the intended classes, but with a majority of predictions distributed among other classes. This results from the fact that our obfuscation metric does not account for the statistical relationship between more than two classes, nor does it account for any dependency between the CIC-Darknet2020 feature values.
There is much more remaining work that could be done to extend the adversarial obfuscation analysis presented this paper. Real traffic features could be modified on live network traffic (e.g., changing IP addresses, ports, packet lengths or intervals), or select features could be prohibited from modification during obfuscation, which is likely to be a realistic constraint. An even larger task is to explore the dependency between features in order to anticipate counterattacks. One possible avenue that future research could take with respect to the CIC-Darknet2020 dataset is to develop an obfuscation method to exploit Random Forest feature importance, or the weights of a linear SVM. This might better correlate the relationship between classifier response and dataset statistics. We only tested our obfuscation method using our best-performing classifier. It would also be interesting to explore how other classifiers respond to similar obfuscation techniques, so as to determine which classifiers are most robust to such attacks.

Author contribution
Mark Stamp proposed and guided the research, and edited the paper.
Nhien Rust-Nguyen performed the majority of the experiments, developed some of the key ideas used in this research, and wrote the first draft of the paper.
Shruti Sharma completed several of the experiments included in the paper.
Nhien Rust-Nguyen received her master's in computer science in May 2022. Her research interests are in applications of machine learning and deep learning.
Shruti Sharma will received her master's in data science in December 2022. Her research interests are in applications of machine learning and deep learning.

Mark
Stamp is a professor of computer science at San Jose State University. His primary research focus is on problems at the interface between information security and machine learning. He has published more than 150 research articles and textbooks in information security (Information Security: Principles and Practice, 3rd edition, Wiley, September 2021) and machine learning (Introduction to Machine Learning with Applications in Information Security, 2nd edition, Chapman and Hall/CRC, May 2022).