Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification

With the increasing prevalence of encrypted network traffic, cyber security analysts have been turning to machine learning (ML) techniques to elucidate the traffic on their networks. However, ML models can become stale as new traffic emerges that is outside of the distribution of the training set. In order to reliably adapt in this dynamic environment, ML models must additionally provide contextualized uncertainty quantification to their predictions, which has received little attention in the cyber security domain. Uncertainty quantification is necessary both to signal when the model is uncertain about which class to choose in its label assignment and when the traffic is not likely to belong to any pre-trained classes. We present a new, public dataset of network traffic that includes labeled, Virtual Private Network (VPN)-encrypted network traffic generated by 10 applications and corresponding to 5 application categories. We also present an ML framework that is designed to rapidly train with modest data requirements and provide both calibrated, predictive probabilities as well as an interpretable"out-of-distribution"(OOD) score to flag novel traffic samples. We describe calibrating OOD scores using p-values of the relative Mahalanobis distance. We demonstrate that our framework achieves an F1 score of 0.98 on our dataset and that it can extend to an enterprise network by testing the model: (1) on data from similar applications, (2) on dissimilar application traffic from an existing category, and (3) on application traffic from a new category. The model correctly flags uncertain traffic and, upon retraining, accurately incorporates the new data.

Abstract-With the increasing prevalence of encrypted network traffic, cyber security analysts have been turning to machine learning (ML) techniques to elucidate the traffic on their networks. However, ML models can become stale as known traffic features can shift between networks and as new traffic emerges that is outside of the distribution of the training set. In order to reliably adapt in this dynamic environment, ML models must additionally provide contextualized uncertainty quantification to their predictions, which has received little attention in the cyber security domain. Uncertainty quantification is necessary both to signal when the model is uncertain about which class to choose in its label assignment and when the traffic is not likely to belong to any pre-trained classes.
We present a new, public dataset of network traffic that includes labeled, Virtual Private Network (VPN)-encrypted network traffic generated by 10 applications and corresponding to 5 application categories. We also present an ML framework that is designed to rapidly train with modest data requirements and provide both calibrated, predictive probabilities as well as an interpretable "out-of-distribution" (OOD) score to flag novel traffic samples. We describe how to compute a calibrated OOD score from p-values of the so-called relative Mahalanobis distance.
We demonstrate that our framework achieves an F1 score of 0.98 on our dataset and that it can extend to an enterprise network by testing the model: (1) on data from similar applications, (2) on dissimilar application traffic from an existing category, and (3) on application traffic from a new category. The model correctly flags uncertain traffic and, upon retraining, accurately incorporates the new data. We additionally demonstrate good performance (F1 score of 0.97) when packet sizes are made to be uniform, as occurs for certain encryption protocols.
Index Terms-network traffic classification, virtual private networks, machine learning, discrete wavelet transform, encrypted traffic, cybersecurity

I. INTRODUCTION
The proportion and popularity of encrypted network traffic has quickly risen over the last several years, with over 95% of traffic across Google and 97 of the top 100 websites defaulting to encryption [1]. This is a boon for online privacy; however, there is a corresponding increase in encryption for malware delivery [2], [3]. Beyond that, MITRE ATT&CK enumerates several techniques in which encryption has been used to obfuscate command and control or to masquerade one application as another. Although signature-based techniques such as website fingerprinting [4], [5] and (e.g.) Palo Alto's App-ID can flag particular packet sequences that indicate a known website request or protocol handshake, the general problem of inferring an application when signatures and heuristics fail remains a challenging problem. In particular, Virtual Private Network (VPN) providers have also lowered the barriers for users not only to encrypt the ports, protocols, and IP addresses of their connections, but also to obfuscate the IP address of the VPN server itself [6] and/or pad all packets to be the same size before encryption [7], which can defeat signature detection capabilities. In light of these difficulties, network security analysts have begun turning to machine learning (ML) approaches that leverage latent signals in the packet timings, sizes, and their encrypted payloads in order to predict the applications that generated the encrypted traffic.
Existing work in machine learning for encrypted traffic classification has focused primarily on feature construction and model development to optimize raw performance metrics such as overall accuracy. For example, traditional machine learning models (like naïve Bayes and decision trees) have been applied to simple statistical features [8], [9] and neural networks based on one or two-dimensional convolutions have been applied to sequences of raw byte values, interarrival times, and wavelet coefficients [10]- [16]. Many of these techniques yield very good performance on publicly available data, the most popular being the ISCXVPN2016 dataset [9].
In this work, we build upon some previously developed features (interarrival statistics and wavelet signal processing) with an eye towards enabling rapid learning with limited data. In addition, we focus on developing an ML model with accurate uncertainty quantification that is able to (1) produce accurate probabilistic predictions for the classes on which the model is trained, and (2) detect examples dissimilar to application categories that the model was trained to predict. While there has been some limited mention of uncertainty quantification in cyber security applications of machine learning in general [17]- [20], the problem has not received the same attention in cyber security as it has in other domains, such as computer vision and healthcare. We consider this unfortunate and believe that in a dynamic environment such as a computer network, it is critically important to quantify model uncertainty. A recent review highlighted the need to reduce errors (false positives) when moving from a closed training set to real-world data for encrypted network traffic analysis [21].
In our work, we develop a classifier for encrypted network traffic based on prototypical networks, which leverage distances to class prototypes (means) in a learned embedding space to compute class probabilities and thus predictions [22]. We use the recently developed relative Mahalanobis distance between test examples and class prototypes to detect out of distribution (OOD) examples [23]. Our model uses as input a mixture of simple features derived from flow statistics as well as wavelet transform coefficients to produce class predictions for segments of bi-directional network flows. To enable regular predictions in VPN settings in which (1) the embedded 5tuples for connections are encrypted and (2) the beginnings and endings of connections are obfuscated, we split all network flows into discretely sampled time segments and make one prediction for each time segment in each flow. We apply our model to a new dataset (which we have made publicly available) of encrypted network traffic collected on a testbed from ten applications that cover five broad, but not comprehensive categories of traffic. This dataset supplements those already available in the literature, which either cover only very specific types of network traffic such as the UC Davis dataset of QUIC traffic [14] or the Orange dataset of HTTPS traffic [16], or contain artifacts such as unencrypted packets that could affect the validity of models trained on them 1 . When trained with an 80/20 train/test split, our model has a micro F1-score of 0.98. Further, we investigate our model's performance when training with only small fractions of the dataset, achieving (for instance) 95% accuracy on VoIP traffic in a 5-class problem using just under 7 minutes of data capture.
We stress-test the model by attempting to transfer its learning from the testbed to an enterprise network over several categories. We found that in some instances (a streaming and VoIP application), the model cleanly transfers onto another network with high accuracy and low uncertainty, and in others (two file transfer applications), the model successfully reports low in-distribution confidence and can be quickly retrained to incorporate the new applications for identification in the known category.
To test our model's ability to identify OOD examples, we perform inference on PCAP containing an unseen traffic category (Zoom) from an enterprise network and demonstrate that the model reliably assigns the new category a high OOD score, with over 77% predicted as significantly out of distribution. Upon retraining with about two hours of labeled examples of Zoom, which takes under 5 minutes, the fraction of significantly OOD test examples drops to 26%.
Finally, we test our model's robustness by investigating the case where all packets are padded to have identical sizes, as occurs in the Noise Protocol or in certain implementations of Traffic Flow Confidentiality [7], [24]. We see that the model (which utilizes both packet timing and size information) is still able to obtain very good performance (micro F1-score of 0.97) indicating that packet size information is not necessary for accurate classification. In addition, this observation calls into question the degree of privacy offered by such obfuscation strategies.
To summarize, the primary contributions of this paper are: • a new public dataset of application-labeled, VPNencrypted and non-VPN-encrypted network traffic, • an ML architecture for encrypted traffic designed to quickly train with limited labeled data and provide calibrated, uncertainty-contextualized predictions for two tasks: (1) predictive uncertainty, when the model is indecisive about which known label a sample belongs to and (2) model uncertainty, when the model encounters OOD data potentially from an unknown label, enabling the model to be systematically extended to new data, and • a demonstration of high accuracy and F1 score for application data that is grouped into discrete time windows, instead of by collecting statistics or data for an entire connection or relying upon fingerprints or signatures, enabling sequential application predictions under a VPN.

II. RELATED WORK
In this section, we describe existing work in the literature related to traffic classification in general, encrypted traffic classification, and website fingerprinting. Due to our focus on uncertainty in encrypted traffic classification, we briefly review relevant work on uncertainty in deep learning from the machine learning community. Finally, we discuss existing encrypted traffic datasets and our motivation for collecting our own.

A. Early Traffic Classification
In the early days of the Internet, censors used port-based traffic classification to restrict Internet content [25]. Portbased classification models could classify network traffic using the port numbers encoded in the packet headers because the Internet Assigned Numbers Authority maps services to unique port numbers [26]. In response, clever Internet users and application developers adapted to port-based censorship by dynamically changing port numbers, hiding behind port numbers already assigned to well-known, trusted applications, or using unregistered port numbers [27]- [29]. Deep packet inspection (DPI) then emerged as a technique that inspects the payloads of packets and classifies them accordingly [25], [30]- [33]. DPI is often used to check for malicious code, obtain situational awareness on a network, and eavesdrop on connections. For example, the operators of the Great Firewall of China have used DPI to inform active probing to identify and censor Tor [25]. However, with the rise of easy-to-use encryption schemes such as HTTPS and VPNs, DPI has lost its effectiveness. Although possible, traffic decryption is not a viable countermeasure, because decryption not only compromises user privacy but is also computationally expensive and difficult to implement [34]. VPNs in particular can effectively obscure payloads, ports, and IP addresses, through encryption, leaving only packet timing and size information as potential signal sources.

B. Encrypted Traffic Classification
To overcome the limitations of port-and payload-based classification models, researchers have focused on traffic classification models that use features from observable encrypted traffic metadata [21], [34].
Some examples of features constructed from observable metadata include features based on simple statistics derived from flows [9], [28], [31], [34]- [43] and wavelet-based features [10], [44]. While traffic classification methods that use flow statistic-based features yield respectable results for isolated application classification in contained environments, it is uncertain whether they are robust enough to be utilized in realworld settings. Flow statistic-based features can vary widely between network topologies and are not able to characterize the highly variable and bursty nature of Internet traffic [45]- [48]. On the other hand, wavelet-based features can capture the inherent nonlinearities of Internet traffic, such as jumps and edges, by providing representations of the observable metadata from the bidirectional connection that are localized in both time and frequency [49]. Shi et al. [44] are the first to use wavelet-based features for traffic classification and show that SVMs trained with wavelet features outperform those trained with flow statistic-based features. Later [10] used Convolutional Neural Networks (CNNs) trained on waveletbased features for traffic classification. We build on this approach by dividing the connections into chunks of time rather than classifying the aggregate connection all at once and combining flow statistic-based features and wavelet features in each time window.

C. Website Fingerprinting
Website fingerprinting is a process of recognizing specific web traffic (e.g., to particular websites) based on unique patterns in the traffic [50]. This approach leverages the fact that websites tend to have unique packet sequence patterns (handshakes) that allow for identification. Traffic features can include information such as unique packet lengths, sequence lengths, packet ordering, and packet interarrival timings [51]. These features are then fed to ML algorithms that classify traffic based on the observed patterns. However, the identifying handshake must be learned beforehand and observed in a traffic sample to enable detection. Website fingerprinting techniques also tend to be sensitive to changes in network patterns such as packet padding and fluctuations in packet round trip time, and it is challenging to know when the models must be updated [52].
We envision our approach would work in tandem with website fingerprinting and signature-based approaches, allowing for contextual, uncertainty-quantified predictions to be made before or after a handshake has been identified. An example of this is the case where, within some packet capture, website fingerprinting identifies a handshake to a malicious website. Our approach could then also be applied to the connection to determine if any file transfer or new command and control channels occurred after the signature was detected.

D. Uncertainty in Deep Learning
Quantifying the uncertainty of model predictions is an active area of research in the machine learning community. Considerable research has been focused on improving the ability of classification models to output calibrated probabilistic predictions and to detect OOD examples at test time.
To be clear, a model's predictive probabilities are calibrated if their confidence (highest predicted probability) matches their accuracy. In other words, a model should correctly label 90% of the examples that it categorizes with 90% confidence. Guo et al. [53] showed that while shallow neural networks tend to be reasonably well calibrated, deep neural networks are not. Various procedures for improving calibration have been proposed including temperature calibration [53], deep ensembles [54], and Bayesian Model Averaging [55].
We now briefly review some of the major approaches to OOD detection in the machine learning literature. The degree to which examples are out of distribution is usually quantified either using the distribution of predicted probabilities for examples or some notion of distance to the distribution of training data. For example, the magnitude of the largest predicted probability was used in [56] as a score, encoding the assumption that if a data example is out of distribution, no class will be assigned a high probability. Other work [57], improved this approach by applying input preprocessing and softmax temperature scaling to improve separation between the largest predicted probabilities of in and out of distribution examples. In another approach [58], the authors model the output probabilities as a distribution and place a Dirichlet prior over them. During training, they attempt to enforce the assumption that uncertain examples produce a flat distribution of predictive probabilities by training on OOD examples. This idea is improved in [59], which removes the necessity for using OOD examples in training.
On the other hand, the authors in [60] fit class-conditional Gaussian distributions in the embedding space of a model with softmax output probabilities and compute Mahalanobis distances to the Gaussian distributions at test time. A very similar approach is taken in [61], but in the context of metalearning and prototypical networks. In this case, the authors use the distance to the closest class prototype as an OOD score. Our approach is similar to that in [61], but instead makes use of the relative Mahalanobis distance from [23] and converts the OOD score to a p-value for greater interpretability.

E. Publicly Available Encrypted Traffic Datasets
The development of reproducible research in encrypted traffic classification is hindered by the shortage of publicly available encrypted traffic data. In 2016, the Canadian Institute for Cybersecurity published the ISCXVPN2016 dataset of VPN and non-VPN traffic from 20 applications and seven traffic categories [9]. As the first publicly available dataset of VPN traffic, the ISCXVPN2016 dataset has been used by many researchers to train and test ML models. Other datasets have since been released as well, such as the UC Davis dataset of QUIC traffic [14] or the Orange dataset of HTTPS traffic [16], but lack the diversity of applications and traffic categories.
Despite its widespread use, we found that the IS-CXVPN2016 dataset contains discrepancies in the VPN captures that compromise the integrity of encrypted network traffic classification methods evaluated on it. Upon closer inspection of the PCAP files from the VPN captures, we find packets with unencrypted payloads. For example, Figure 1 shows the unencrypted payload of the 17th packet in the PCAP file for the ICQ Chat VPN capture in vpn_icq_chart1b.pcap. We also find PCAP files for VPN-labeled captures to contain multiple connections, although we expect the PCAP files for a VPN session to contain a single connection between the VPN client and the VPN server. For example, the vast majority of packets in the vpn_netflix_A.pcap file are over port 80 to a Netflix IP address, the majority of packets in vpn_hangouts_audio1.pcap resolve to a Google IP address, and the majority of packets in vpn_facebook_audio2.pcap resolve to a Facebook IP address (with a minority fraction also going to Google). These findings imply that either the network was tapped before the packets went through the VPN client or the packets were not encrypted.
Unencrypted packet payloads will unfairly strengthen traffic classification methods that leverage packet payloads, such as DPI or the deep packet approach in [15]. Additionally, the presence of multiple connections per VPN capture will unfairly strengthen methods that leverage connection information aggregated at the flow-level.
In light of these concerns, we recommend that researchers seeking to utilize the ISCXVPN2016 dataset ensure that their traffic classification methods do not presume that VPN-labeled packet payloads are VPN-encrypted. Additionally, if their methods for analyzing VPN-encrypted traffic rely on connection information aggregated at the flow-level, we recommend that they post-process the packet headers to ensure that all packets appear under the same 5-tuple, as they would inside a VPN connection. Alternatively, one could also replay the VPN-labeled files through a VPN.
Finally, there is also a challenge with respect to consistent assignment of the traffic categories to the collected PCAP files. As an example, the file skype_video1a.pcap could reasonably be categorized as VoIP or Web Browsing, but probably would have fit better with other video-teleconferencing PCAPs in its own VTC category.
In response to the limitations of the ISCXVPN2016 dataset, we introduce a supplemental, labeled dataset of VPN and non-VPN network traffic for public use and to test our extensible machine learning framework.

III. ENCRYPTED TRAFFIC DATASET
In this section, we describe the characteristics of our dataset 2 and outline the process used to produce the dataset. Our dataset is a collection of 37.5 GB of labeled network traffic that contains both VPN-encrypted and unencrypted flows from the five traffic categories shown in Table I, which is inspired by the categories in [9] with a few omissions (e.g., P2P, Web Browsing) and a new category (Command & Control). See Table V in Appendix A for the file mappings used to assign class labels.

A. Dataset Characteristics
Our dataset consists of 37.5 GB, 44,981 connections, and approximately 272 hours of packet capture from five traffic categories as listed in Table II. The relative breakdown of sizes, connections, and times per traffic category is visualized in Figure 2. Although the File Transfer category constitutes 87% of the total size of the dataset, it only contributes to roughly 36% and 5% of the total connections and time, respectively. In contrast, the Command & Control category only makes up about 2% of the dataset in size while comprising 30% of the connections and almost 47% of the total time. The Streaming category contributes approximately 11%, 29%, and 1% of the total size, connections, and time. The Chat category contributes less than 1% of the total size, 3% of the total connections, and 45% of the total time. Finally, the smallest traffic category, VoIP, makes up less than 1% of the total size and time, and less than 5% of the total connections.

B. Network Setup
To produce the dataset, we begin by creating virtual subnetworks for each traffic category as shown in Figure 3. Each subnetwork contains a client, a client DNS server, a VPN client, and a VPN server. The Skype subnetwork contains an additional client to allow for bidirectional chat. The video streaming and web browsing subnetworks are connected to the Internet to enable access to Firefox, Chrome, YouTube, Netflix, and Vimeo. As shown in Figure 3, VPN traffic is captured between the VPN client and the VPN server. Separately, non-VPN traffic is captured between the VPN client and the application layer.

C. Data Generation
Once the network setup is complete, we generate the dataset for each traffic category individually to ensure data purity. All traffic is captured using tcpdump and stored in the PCAP format. Filename-to-category mappings are provided in Table  V in the Appendix. The Streaming category includes Vimeo, Netflix, and YouTube traffic generated by team members watching videos on each platform. The VoIP category is composed of VoIP audio generated by team members having conversations with one another using Zoiper. The Chat category consists of Skype Chat traffic generated with bots that replay actual chats from publicly available logs from Gitter.im chat rooms [62].
The Command & Control category includes SSH and RDP traffic. RDP traffic is generated by team members actively performing tasks, such as editing Word documents, over an RDP connection. To emulate SSH traffic, we created a script to randomly sample 2 to 20 commands with replacement from a predetermined list of 17 commands. For each command, it samples a random wait time between 1 and 60 seconds. Once it is finished executing each command and wait time, it waits 60 seconds before repeating the entire process from the beginning.
The File Transfer category contains SFTP, RSYNC, and SCP traffic. To emulate users performing file transfers, we created scripts that randomly choose a file out of a list of 15 files of sizes ranging from 1 KB to 1 GB. Then, they randomly decide whether to send the chosen file to or from a remote server. After executing the file transfer, the scripts wait 60 seconds before selecting a new file.

IV. METHODS
In this section, we discuss the features that we used as model input, the specifics of our model architecture, and how we define out of distribution scores.

A. Data Pre-processing
We group packets into connections based on the five-tuple consisting of the source IP address, destination IP address, source port, destination port, and protocol using the dpkt 3 Python module. We split the connections into 40.96-second intervals 4 and discard any intervals containing fewer than 20 packets. For each 40.96-second interval, we obtain the packet timestamps, payload sizes, and directions discretized into 0.

B. Feature Construction
For each 40.96-second interval, we compute aggregate statistics (such as total, average, maximum, etc.) for flow features such as interarrival times and time spent as active or idle. We compute statistics both for the flow's forward and backward direction. We note here that in our current implementation, "forward" and "backward" are arbitrary; the first observed packet in the connection is designated as "forward" for the remainder of the connection. The full set of aggregate statistics features are listed in Table III and are very similar to those in [9]. We also generate wavelet-based features from the 40.96second intervals using the discrete wavelet transform as done by [44]. We select the Haar basis as the mother wavelet because it produces the maximum energy to Shannon entropy ratio. Before calculating the wavelet-based features, we extract the detail coefficients from frequency bands 0 to 12 of the wavelet-transformed connection sizes in the forward and backward directions using the PyWavelets Python package [63]. Then, we use the detail coefficients of each band to calculate the wavelet-based features from the connection sizes in the forward and backward directions. The wavelet-based features include the relative wavelet energy, Shannon entropy, and absolute means and standard deviations of the detail coefficients as shown in Table IV. The absolute means and standard deviations of the detail coefficients are log-transformed to induce symmetry because the original distributions of these features are heavily right-skewed. See Appendix A for details on how to calculate the wavelet-based features.
In total, this data pipeline collapses the (up to) gigabytes of network traffic that could flow across 40.96 seconds into two vectors of length 4096 that are then reduced to 129 features per time-window. Some of these features are also correlated and could be winnowed (data not shown) to further sparsen the signal.

C. Classification Model
Prototypical Networks are neural networks that compute class probabilities using the distances to a set of so-called prototypical examples [22], rather than by applying a softmax activation function to the output layer. The neural network that forms the base of the prototypical network is usually Std. deviation of backward detail coefficients referred to as an embedding function. We denote the embedding function as f θ : R D → R E , where D is the dimension of the feature space, E is the dimension of the embedding space, and θ are weights to be learned. We assume access to a set X = {x i } of examples with corresponding labels y i ∈ {1, 2, . . . , K}. We will use the notation X train , X cal , and X test to refer to partitions of the dataset. The prototype for each class k, 1 ≤ k ≤ K, which we denote as c k is simply the mean of the embedding of a few For a given observation, the class probabilities are calculated by applying the softmax activation function to the distances between the class prototypes and the embedding of the observation. Specifically, the probability of class k for input example x i is where each d k is a distance function on the embedding space.
In our application, we use an embedding model with four fully connected hidden layers each containing 64 neurons with Relu activations. We perform 25% dropout on the weights connecting the second and third and the third and fourth layers. We learn our model's parameters using the episodic training paradigm [22]. In episodic training, for each of N episode episodes, we sample N query query examples and S train support examples. The prototypes are computed from the support examples and we compute the cross-entropy loss over the query examples. We summarize our training procedure in Algorithm 1.
Throughout the paper, we set N episode to 20000, N query to 512, and S train to 5. As in [22], the choice of squared Euclidean distance during training roughly corresponds to spherical Gaussian densities in R E ; this motivates our use of the Mahalanobis distance for OOD scoring.

D. Out of Distribution Scores
We now describe how we obtain OOD uncertainty measures from the prototypical network. Our OOD score is based for each class k and query (1) where each d k is the Euclidean distance Compute the gradient of the loss over the query examples; update parameters θ end for Ensure: f θ on Mahalanobis distance. Mahalanobis distance is used to measure the distance between a point and a distribution. Given a point x and a distribution with mean µ and covariance matrix Σ, the Mahalanobis distance between x and the distribution is

Calculate class probabilities as in Equation
For a dataset consisting of K classes, the relative Mahalanobis distance to class k is defined to be the Mahalanobis distance to class k minus the Mahalanobis distance to the overall data distribution [23]. That is, where µ k is the mean of class k, µ 0 is the mean of the dataset as a whole, Σ is the covariance matrix of each class k, and Σ 0 is the covariance matrix of the dataset as a whole. To be clear, The authors in [23] argue that subtracting out M (x; µ 0 , Σ 0 ) removes the outlier score contribution along dimensions that are not discriminative between in and out of distribution examples. This is especially important for scenarios in which a large number of dimensions are not discriminative.
While the relative Mahalanobis distance is effective at discriminating between in and out of distribution examples, it is not particularly interpretable. To ease its use, after training is complete, we fit a kernel density estimate to the relative Mahalanobis distances computed on a set of held-out in distribution examples X cal . We use a support set of size S cal = 100 sampled from the training data to fit the means and full covariance matrices used in the relative Mahalanobis distance. We refer to this process as calibration and summarize our procedure in Algorithm 2.
At test time, to compute a p-value for a particular test point, we calculate the area under the density corresponding to the predicted class for distances larger than the relative Mahalanobis distance of the test point. To maintain the convention that large scores indicate OOD examples, we take one minus the p-value as our OOD score. We summarize this procedure in Algorithm 3.

V. EXPERIMENTS AND RESULTS
In order to validate our framework, we perform a series of increasingly challenging tests for our machine learning framework.

Algorithm 2 Calibrate
Require: X train , y train , X cal , y cal , S cal , f θ Sample support {x k s } S cal s=1 from the training data for each class k Compute µ k , for each class k, µ 0 , Σ and Σ 0 as defined in Section IV-D with S = S cal

A. Model Performance on the Dataset
First, we train the model on the provided dataset using an increasing, stratified fraction of the dataset and compute the F1-scores (harmonic average of precision and recall) on a held-out test set (20% of the data). We also compute the model's Expected Calibration Error (ECE) as a way to measure the quality of its probabilistic forecasts. ECE measures the discrepancy between a model's predicted probability and the observed accuracy of those predictions. For more details, see [53]. For each training dataset size considered, we repeat the experiment ten times. The results for F1-score and ECE are shown in Figs. 4 and 5. Clearly, the model's performance improves as we increase the fraction of data provided, both for F1-score and ECE. It is interesting to note that we achieve quite good performance with only a limited number of training examples. To call out a specific example, using 10% 5  Even with only this very small training set, we are able to obtain a micro F1-score of nearly 0.96. Additionally, the calibration error using 10% of the data for training is about 0.04, which indicates the model's confidence is fairly consistent with its accuracy. Marginal gains in calibration (lower ECE values) can be achieved via neural network ensembles as in [18] (data not shown); we speculate that the relatively shallow network, dropout, and Prototypical Network architecture helped to en- Fig. 4. Micro F1-score of the model against the fraction of data used to train the model. We plot the average F1-score over ten trials. The error bars indicate one standard deviation. Fig. 5. ECE of the model against the fraction of data used to train the model. We plot the average F1-score over ten trials. The error bars indicate one standard deviation. able the low ECE values. Fig. 6 depicts the average confusion matrix for each of the ten training runs using an 80/20 train/test split, indicating that the model performs well in every category, the worst confusion being 3% of File Transfer examples being incorrectly predicted as VoIP.

B. Model Performance in a New Environment
Having established the "usual" ML metrics, we then turn to a more challenging problem of validation outside of the training dataset. To do these experiments, we used Wireshark to capture network traffic on a live, enterprise network, and then tested our model on this new traffic. We consider three increasingly challenging scenarios. First, can the model recognize traffic from an existing application in an existing category? Second, can the model recognize new applications in an existing category? Finally, can the model detect data that is out of distribution (comes from a new category)?
To answer the first question, we collected 40 additional instances of YouTube (an application in the Streaming category). We trained the model on 80% of our training dataset and then tested on the YouTube data. We repeated the experiment ten times. The results were quite good. For the YouTube traffic, the model predicted the correct class 40 out of 40 times in nine of the trials and 39 out of 40 times in the other trial, only predicting one of the examples to have an OOD score above 0.95 on average 6 . Thus, even though the YouTube traffic was collected on a different network than the one that the training data came from, the model was still able to recognize its class (Streaming). This is an encouraging result, although we do not rely on this transferability holding true for all applications in the training set, as will be demonstrated next.
To answer the second question, we collected data from three applications that do not appear in the training data (Avaya, Samba (SMB), and the Code42 automated file backup program). In our labeling schema, we consider Avaya to be a VoIP application and SMB and Code42 to be File Transfer. We collected 123 examples of Avaya, 141 examples of SMB, and 202 examples of Code42. As before, we trained on 80% of our training dataset (for ten trials) and tested on the new applications. Avaya was easily recognized by the model as VoIP traffic, correctly labeling all 123 examples in each trial, although the model was less certain about them, assigning about half of them OOD scores above 0.95. SMB and Code42, however, were not initially recognized as File Transfer. Encouragingly, the model tended to give these examples a high OOD score. Specifically, across the ten trials, an average of 78% of the SMB traffic received an OOD score above 0.95, and an average of about 63% of the file backup program received an OOD score above 0.95.
To see whether the model could learn to classify SMB and Code42 as File Transfer, given a few training examples, we investigate retraining the model. For the case of SMB, we add 91 examples to our training set and test on the remaining 50. After retraining, the model is able to correctly classify an average of 92% of the test samples. In addition, the OOD scores are much lower. After retraining, only an average of 10% of examples had OOD scores above 0.95 (left side of Figure 7). Figure 7 also provides the predictive confidences before and after retraining with examples of SMB, which refers to the probability associated with the class prediction (the highest probability class among known classes). The distributions exhibit high confidences both before and after retraining. This indicates that it is important to not rely solely on the predictive confidences for uncertainty quantification, as they are essentially meaningless when applied to OOD test examples.
In the case of Code42, we add 42 examples to our training set and test on the remaining 160. The model now correctly classifies an average of 98% of the test samples. As with SMB, the OOD scores are now much lower as well, with only an average of 5% with scores above 0.95. In Figure 8, we show OOD score and predictive confidence before and after training with examples of Code42.

C. Model Performance on Padded Packets
Finally, in order to address any concerns about the model's resilience to encryption protocols that pad packets to be of uniform size, we repeated the initial testing on the collected dataset but masked the sizes of all packets to be 1500 bytes before feature generation. Upon retraining ten times with 80/20 train/test splits, the model retained an average F1-score of 0.971 ± 0.008 on the dataset. This is a marginal drop from the raw, unpadded data, at 0.982 ± 0.004. These results indicate that our model is not much affected by the effective removal of packet size information. In addition, this result presents initial evidence that packet padding alone may not offer substantial privacy benefits, at least at the level of granularity of traffic categories.

VI. DISCUSSION
The results of the previous section appear very encouraging, particularly since (1) the framework performs effectively, achieving high accuracy without requiring rigorous feature optimizations and on only timing and size information, (2) the dataset and model extended in part to another network for YouTube and VoIP, and (3) the model "degraded gracefully" when confronted with novel application data and was quick to retrain on limited data. We envision this framework to be used in a tool that would allow network analysts to systematically build insight for applications of interest in conjunction with other analysis techniques (malware signatures, website fingerprinting, etc.) to supplement where they fail.
Shorter window-lengths could hypothetically provide more label opportunities over the entire observation period, but arbitrarily short window lengths would reduce accuracy (additional analysis is provided in Figure 10). Longer window-lengths within a VPN would also be more likely to merge multiple application types into the same sample, and our approach only predicts one application per time window; future work needs to address concurrent applications. We also note that this framework does not leverage continuity between timewindows as of yet, and certain connections can tend to be shorter (loading a web page) or longer (streaming movies).
Our approach can also be attacked and degraded by manipulating the traffic, e.g., by adding "cover packets" to flood connections or mask timings, re-encoding traffic through another protocol [64] or by leveraging adversarial ML [65], [66]. However, our approach still retains an advantage in that it forces an adversary to work harder to obfuscate the nature of their connections. We also note that our approach could be used to "red-team" censorship evasion development by incorporating uncertainty in addition to the standard accuracy metrics.
We finally comment that although our dataset does not comprehensively capture the diversity of Internet applications, it was enough to "generate intuition" for a generic ML framework to predict time-windowed, encrypted application traffic on a different network, and we have demonstrated the importance of treating uncertainty as a first-class feature to enable a systematic extension of the dataset onto other networks or application categories of interest.

VII. CONCLUSIONS
We presented a labeled dataset of VPN-encrypted and unencrypted network traffic to facilitate machine learning applications seeking to leverage packet size and timing information to predict the traffic's most likely application categories. We discussed potential issues and possible workarounds for a popular alternative dataset, which motivated the organization of our dataset. We showed a machine learning framework leveraging wavelet-based and statistical features on discrete, sampled time windows in a Prototypical Network architecture that achieved high F1 scores (0.98) on the labeled dataset. Further, we showed the architecture could be trained on a small, random fraction of the dataset, achieving an accuracy of 0.96 on VOIP using fewer than 7 minutes of data. Additionally, we demonstrated low calibration errors for predictive confidence (3%). We also tested the model on application traffic (YouTube and a new VoIP application) from an enterprise network and found it extended accurately.
We used a technique for constructing an out-of-distribution (OOD) score based on p-values of the relative Mahalanobis distance on in-distribution data to reliably indicate model degradation in a series of increasingly challenging experiments. In all cases, the model retrained quickly with relatively few examples to learn to incorporate the new applications into a known category or assign them to a new one. We additionally demonstrated that the framework retains very high performance (F1-score of 0.97) even when all packet sizes are uniform, such as occurs in encryption protocols that ensure all packets are padded to the same size before encryption. We finally discussed limitations of the current approach.  Table V to assign class labels to each PCAP file. If any of the words in the "File Keyword" column occurs in the column name, then we assign the label in the left column.

WAVELET-BASED FEATURES
This section provides details on how to calculate the wavelet-based features extracted from the wavelet-transformed connection sizes in the forward and backward directions for frequency bands 0 to 12. The wavelet-based features include the relative wavelet energy, Shannon entropy, and absolute means and standard deviations of the detail coefficients.
The continuous wavelet transform coefficients are defined by the equation: where x(t) is the time-dependent signal over the observation period [0, T ] (for instance, bytes transferred in a flow during a sample interval), ψ t−b a is the wavelet function, a is the scale factor, and b is the translation factor. This work employs the discrete wavelet transformation, which restricts a and b to discrete values that balance temporal and frequency resolution while reducing computational burden. We employ the shiftinvariant (stationary) wavelet transformation to ensure that the resulting features are insensitive to signal translations, otherwise known as the "algorithm a-trous" [67], [68].
The time-dependent signal, x(t), required for the wavelet transform, is represented by a discrete vector over an observation period from [0, T ] that is discretized into N smaller time bins of width ∆t. Within each ∆t time interval, all of the packet statistics are aggregated. For instance, let t j , y j be the packet timestamp and size for packet j in a series of J flow-aggregated packets with t 0 = 0. Then a vector version of x(t), appropriate for the wavelet transform, is defined by x n (t), where T = N ∆t, n ∈ [0, N − 1], and x n (t) = J j=0 y j χ j (t) ; χ j (t) = 1 n∆t ≤ t j < (n + 1)∆t 0 else .
We enforce that the length of the vector x n be an integer power of 2 (N = 2 K for some K) by judiciously selecting ∆t and T , or by reflecting the signal in our pipeline. Applying Eq. 2 for fixed values of a ∈ 2 k and b ∈ x n gives a rectangular coefficient matrix of the form d n,k , where n ∈ [0, N − 1] indexes the coefficients in time and k ∈ [0, K] indexes the coefficients in frequency-band.

Relative Wavelet Energy
For each frequency band k, the relative wavelet energy ρ k is given by where E k is the energy of frequency band k E k = n |d n,k | 2 for n = 1, 2, . . . , N , and E total is the total energy of the signal after wavelet decomposition E total = k E k for k = 0, 1, . . . , K .

Wavelet Shannon Entropy
For each frequency band k, the wavelet Shannon entropy S k is given by S k = − n p n log(p n ) for n = 1, 2, . . . , N , where p n is the fraction of energy of the n-th detail coefficient, Detail Coefficient Statistics For each frequency band k, the absolute mean of the detail coefficients µ k is given by µ k = 1 N n |d n,k | for n = 1, 2, . . . , N , and the standard deviation of the detail coefficients σ n is given by σ k = 1 N n (µ k − d n,k ) 2 for n = 1, 2, . . . , N . Figure 10 depicts the results of a range of experiments using three time bins (∆t of 1, 10, and 100 milliseconds) and variable time-windows, T . The general trend is that, the longer the sample window, T , the more accurate the model. However, this has the tradeoff that longer windows require longer vectors, particularly for small time bins (note that we capped the number of wavelet bands at 13 to limit feature-computation requirements). The second-to-rightmost blue datapoint represents (∆t, T )=(0.01, 40.96), the selected value for the analysis in this paper, as this neared the maximum accuracy and used an acceptable vector length for computing the wavelet features.