A survey of methods for encrypted network traffic fingerprinting

: Privacy protection in computer communication is gaining attention because plaintext transmission without encryption can be eavesdropped on and intercepted. Accordingly, the use of encrypted communication protocols is on the rise, along with the number of cyberattacks exploiting them. Decryption is essential for preventing attacks, but it risks privacy infringement and incurs additional costs. Network fingerprinting techniques are among the best alternatives, but existing techniques are based on information from the TCP/IP stack. They are expected to be less effective because cloud-based and software-defined networks have ambiguous boundaries, and network configurations not dependent on existing IP address schemes increase. Herein, we investigate and analyze the Transport Layer Security (TLS) fingerprinting technique, a technology that can analyze and classify encrypted traffic without decryption while addressing the problems of existing network fingerprinting techniques. Background knowledge and analysis information for each TLS fingerprinting technique is presented herein. We discuss the pros and cons of two groups of techniques, fingerprint collection and artificial intelligence (AI)-based. Regarding fingerprint collection techniques, separate discussions on handshake messages ClientHello/ServerHello, statistics of handshake state transitions, and client responses are provided. For AI-based techniques, discussions on statistical, time series, and graph techniques according to feature engineering are presented. In addition, we discuss hybrid and miscellaneous techniques that combine fingerprint collection with AI techniques. Based on these discussions, we identify the need for a step-by-step analysis and control study of cryptographic traffic to effectively use each technique and present a blueprint


Introduction
Various studies are being conducted on privacy protection in the field of computer communication [1][2][3]. It is known that when plaintext is transmitted without encryption, it can be eavesdropped on and intercepted. Hence, encrypted communication protocols are being used to protect privacy by encrypting transmission data. In 2017, Gartner predicted that more than 80% of web traffic can be encrypted [4]. According to the statistics reported as of August 2022, more than 80% of web pages were loaded as Hypertext Transfer Protocol Secure (HTTPS) over Transport Layer Security (TLS) protocols [5].
Encryption communication protocols protect transmission data using encryption algorithms. Clients and servers negotiate the encryption methods and share encryption keys before transmitting the data. After the negotiation, the data are encrypted and transmitted. This method ensures that the confidentiality of the data is maintained by making it impossible to know the content, even if a thirdparty attempts to eavesdrop and intercept the data. However, encryption communication protocols can be used for malicious purposes, thus posing cybersecurity issues. Cisco expected that by 2021, more than 70% of web malware would encrypt traffic, whereas 60% of organizations would fail to detect web malware traffic [6].
Attackers threaten cybersecurity by exploiting the fact that third parties are incapable of accessing the contents of transmission data when encryption communication protocols are used, as shown in Figure 1. In general, the network security equipment is installed in the actual network path or replicates communication packets during traffic inspection. However, the data being transmitted cannot be decrypted if the traffic is encrypted. Therefore, the network security equipment cannot check the contents and take necessary steps based on analyses. Hence, attackers exploit the limited visibility of network security equipment to conduct cyberattacks such as leakage of internal information assets or sending of triggers to malicious codes to penetrate the internal network. It is, therefore, difficult to specify the attempt and scope of the attacks that use encrypted traffic, posing a significant threat.
To manage the risk of encryption communication protocol abuse, decryption must be performed to used in cryptographic communication protocols is expected to act as noise obstructing the input of traffic-type classifiers, thus reducing classifier accuracy. TLS fingerprinting is emerging as a method to compensate for these shortcomings.
TLS fingerprinting uses the handshake and header information of the TLS stack and encrypted application data to identify client applications and web servers and identify and classify the types of content sent. Instead of relying on IP addresses, it generates fingerprints by extracting specific information (such as encryption algorithm chutes and extensions) characterized by a TLS handshake. It also targets encrypted packets to extract features and generate classifiers to respond to encrypted traffic. This technology can be used alone and in conjunction with technology and other network fingerprinting technologies in the existing TCP/IP stack to improve accuracy. Cybersecurity, therefore, requires a broad understanding of TLS fingerprinting to respond to changing networks. This study provides a broad overview of encrypted network traffic fingerprinting techniques that analyze encrypted traffic without decryption. Fingerprinting is used to classify and identify client applications or servers. Fingerprints are generated through non-encrypted data from cryptographic negotiation (handshake) messages. They are then compared against fingerprint databases, or machine learning, artificial intelligence (AI), and statistical techniques are applied to a large number of datasets to generate identifiers and classifiers.
The contributions of this study are divided into two categories. First, encrypted network traffic fingerprinting techniques are investigated, and several fingerprint generation and accuracy improvement methods are described. Second, the investigation results are analyzed by organizing the identifier and classifier generation techniques. They are categorized and described based on the method and features used for identification and classification. Lastly, journals, conferences, research papers, and internet documents are surveyed. The findings are classified and described using the taxonomy shown in Figure 2.
The manuscript consists of six sections. Section 2 explains the background information required for understanding the fingerprinting techniques. Section 3 describes encrypted traffic fingerprint techniques and studies based on them. Section 4 discusses identifier and classifier generation techniques. The last section concludes the paper by presenting the results and future scope of the current research.

Related survey
Our work provides extensive information on TLS fingerprinting gained through investigation and analysis. Prior to the investigation of detailed techniques, we discuss them in relation to a survey study of techniques targeting encrypted traffic [9][10][11][12]. Survey studies have been performed for classification [9,10], detection [12], and analysis [11] as problem domains. Among them, Refs. [9][10][11] focus on techniques using machine learning. In addition, we refer to relevant studies [13][14][15][16] for explanations of techniques using privacy, graph networks, time series data, etc., discussed in the paper. We summarize the related survey papers below, with Table 1 providing a comparison of the problem domains, methods, and protocols of the survey papers described.
Velan et al. [9] discussed classification and analysis techniques for Internet Protocol Security, TLS, Secure Shell Protocol, BitTorrent, and Skype protocol traffic. The classification technique divided the information extraction steps into non-encrypted initialization steps and the encrypted data transfer steps into payload-and feature-based methods. Payload-based methods include techniques that perform pattern matching on payload, which is the transmission of data or operations based on information such as the payload size, port number, and IP address that can be obtained from packets without processing. Feature-based methods extract features from encrypted communication patterns and utilize maps, unsupervised methods, hybrid machine learning, and basic statistical analysis. Pacheco et al. [10] presented a study that systematically described techniques using machine learning for the classification of encrypted network traffic. Basic machine learning required for traffic classification and the representative workflows of traffic classification techniques using machine learning are discussed. Based on the workflow, an overview of the data collection, feature engineering, algorithm selection, and model deployment methods by category is provided.
Oh et al. [11] discussed methods used for analyzing malicious network traffic encrypted with TLS available at the Security Operation Center. Machine-learning-based algorithms and encryption traffic analysis techniques using middleboxes were mainly described. An overview of how TLS interception can be performed over middlebox without a secret key or through a machine learning pipeline, for passive inspection of TLS-encrypted traffic to perform analysis by sniffing traffic and extracting features or performing malware detection using TLS flow fingerprinting was provided.
Papadogiannaki and Ioannidis [12] described cryptographic network analysis applications, technologies, and countermeasures in four use cases. These use cases consist of analytics, security, user privacy, and middleboxes. Analytics identifies protocols and users, security detects malicious traffic, user privacy detects data leakage and fingerprinting, and middlebox performs a deep packet inspection for man-in-the-middle attacks. Each case provides an overview of the application, technology, and response measures adopted. The datasets used in each study are also mentioned.
However, the researchers did not specify or describe TLS fingerprinting, which can be utilized additionally or alone in the technology stack of network fingerprinting. Existing survey papers on TLS fingerprinting provide only partial information on its use as a cryptographic traffic analysis technique, insufficient for a broad understanding of TLS fingerprinting. This paper aims to provide a broad perspective on TLS fingerprinting techniques without being specific to the purpose, technique, or data type.

Survey
Problem domains Method Protocols [9] Classification Focus on ML-based Various [10] Classification ML-based Various [11] Detection Focus on ML-based TLS [12] Analysis Various Various Present study Fingerprinting Various TLS

TLS
TLS is an encryption communication protocol designed to prevent eavesdropping, tampering, and message forgery of client-server data by providing end-to-end encryption. TLS 1.0, based on Secure Socket Layer (SSL) 3.0, achieved communication privacy over the internet [17]. When TLS 1.3 was announced in March 2021, TLS 1.0 and TLS 1.1 were discontinued by the Internet Engineering Task Force (IETF) [18]. Transmission data are protected by authenticating servers and clients using a certificate and public key (asymmetric key) algorithms and performing encryption algorithm negotiation and key exchange for transmission data encryption to encrypt data after negotiation [19][20][21][22]. The TLS structure is illustrated in Figure 3. It protects the data transmitted from application layer protocols (HTTP and FTP) to transport layer protocols (TCP and UDP payload). Traffic between communication peers is protected through record protocols, and handshakes sharing encryption specifications and keys for transmission data protection are conducted.

TLS handshake
Key exchange, server parameter setting, and authentication are the three stages of a handshake, as shown in Figure 4. The Client/ServerHello messages sent during the key exchange phase are not encrypted. A handshake encryption key is shared to protect the messages during this phase. The handshake encryption key is used until an application encryption key is shared in the server parameters and authentication phases. The application data are encrypted using N number of application encryption keys generated through key sharing. The Client/ServerHello messages, which are non-encrypted data, are used to generate fingerprints using cipher suites and extensions.

Markov chain
A Markov chain is a stochastic probability model with a Markov property in which the probability of a particular event depends only on the state attained in the previous event [23]. A stochastic process is a set of probability variables whose state changes stochastically over a certain period. It can be termed as a collection of values that observe the state of an object over time. The probability process can be categorized into discrete and continuous time depending on the time of observation. A Markov chain generally refers to the discrete-time Markov process [24]. If Xn+1 is a specific state and Xn is a historical state, then the Markov chain can be expressed as If ( +1 = | = ), which is a pair of ( , ), is expressed as , then a Markov chain with m states can be encoded as in Eq (2). The matrix is a two-dimensional transition probability matrix, where , , and represent the row, column, and transition probability elements of the matrix. Since the sum of all elements in each row is the sum of the probabilities of transition to a particular state, it is always 1.
A Markov chain can also be represented by a state transition graph using a set of states and a transition probability matrix. For example, Figure 5 illustrates a Markov chain with = 2 and = 4 Markov processes and chains are mathematical techniques for analyzing state changes and state characteristics by approaching complex probability processes with simple assumptions using a set of states and transition probability matrices [25,26]. The state transition graph of a Markov chain can be used to generate fingerprints based on changes in the message type and state of encrypted communication traffic.

AI
AI is a field of study based on the speculation that machines can be used to simulate all aspects of learning and other features of intelligence with accuracy as its principle [27]. AI, which aims to simulate intelligent human behavior, is widely used in the field of data engineering, power demand forecasting, etc. [28]. It aims for an accurate prediction in the form of machine learning systems.
Machine learning extracts features from a large number of data and then learns them into artificial neural networks to perform classification and regression on future inputs [29]. With the development of deep learning techniques that combine human neural networks with deep neural networks, research in various fields and use cases have emerged [30]. It is used to analyze encrypted traffic, such as extracting features from encrypted communication traffic, classifying positive/malicious software traffic, or identifying client applications and services.

Fingerprinting based on fingerprint collection
A technique based on fingerprint collection generates fingerprints for encrypted traffic, stores them in a database, and performs fingerprinting by complete or approximate matching. Fingerprints are generated based on Client/ServerHello messages, statistical techniques, and behavior such as responding to a request message from a particular sequence. They are represented by strings, data formats (XML, JSON), state transition graphs, etc. Fingerprints that are forged, altered, or unregistered have the disadvantage of poor accuracy, as the technique relies on fingerprint information. Research on improving accuracy using statistics and AI techniques is being conducted.

Technique based on ClientHello/ServerHello
This technique generates fingerprints by extracting values from ClientHello/ServerHello and messages sent and received during the key exchange phase. It is primarily a technique that identifies the processes of a client and the services provided by a server and compares them to a prebuilt fingerprint database used for fingerprinting.  [31] proposed a technique that distinguished client processes, taking advantage of the fact that the list of supported encryption algorithms on the server differed depending on the client using the SSL. Subsequently, several studies have proposed fingerprint generation methods using various ClientHello fields based on cipher suits [32 -35]. The fields used are presented in Table 2.
Cipher suites are commonly used for fingerprinting while extensions are used with SSL fingerprinting for p0f [32]. Brotherston [33] presented a demonstration of identifying real-world processes by adding various fields. Althouse [34] discussed JA3 fingerprint, which uses extension and elliptic curve algorithm information for fingerprint generation to simplify the use field and for discrimination. Fingerprint generation codes and methods were shared as open sources to broaden the base. JA3 fingerprint is also used as pulse information in Open Threat Exchange, a threat intelligence-sharing system. An example of a JA3 fingerprint obtained using the ClientHello message is shown in Figure 6. The network flow data capturing and aliasing package named Joy by Cisco uses TLS fingerprints [35]. Similar to the JA3 fingerprint, these fingerprints do not extract the contents of a specific extension for fingerprint usage, but the data of the extension can be included in the fingerprint. Although this technique has the advantages of fast identification speed and ease of use, fingerprints are generated only in the ClientHello messages. Therefore, the technique has a disadvantage in that several processes using the same ClientHello messages may belong to one fingerprint. Accuracy can be improved by utilizing additional techniques, such as machine learning, statistical techniques, and usage of ServerHello information, to compensate for the shortcomings. This is done because the ClientHello of the client may remain the same but the ServerHello of the servers, which are connected, may differ for different applications. JA3S uses ServerHello to generate server fingerprints, while Joy uses the cipher suit selected by the server [35,36]. JA3S with JA3 fingerprints provides additional information on connections. Thus, a classification between connections with ClientHello may be performed. Joy allows the classification between connections, but it uses different fingerprinting information such as different protocols and Operating Systems.
To improve ClientHello/ServerHello-based techniques, Anderson and McGrew [37] used knowledge bases combined with end host and network data to identify the direction of trends in enterprise TLS applications. It also enhanced the understanding of application behavior. Anderson and McGrew [38] presented a method to improve the identification and classification accuracy of ClientHello-based TLS fingerprints using destination context and pre-collected knowledge bases. For this study, a similar fingerprint group was selected through approximate matching, using the Levenshtein distance algorithm, when no exact match for the fingerprint information was found. The matching probability was then computed using the weighted naive Bayes model learned with the destination context and knowledge bases. The process was identified as most likely.

Statistical-based
Korczyné sk and Duda [39] collected all handshake connections that occurred when accessing a particular service (Paypal, Twitter, Dropbox, etc.). Probabilities for the TLS protocol versions and the handshake message occurrences were derived and classified as Markov chains with parameters. Liu et al. [40] generated features by using length Markov models in addition to message-type Markov models [39]. Classification using machine-learning-based classifiers was performed. Liu et al. [41] proposed a method to improve the classification accuracy of application traffic by performing fingerprinting using cipher suite distribution as multi-attributions in addition to the length Markov models [40]. Classification accuracy was derived for applications such as MaMPF and improved accuracy was demonstrated [40]. An example of a statistical fingerprint obtained using a Markov chain and state transition graph is shown in Figure 7.

Enhancements of statistical-based techniques
Chao [42] conducted a study to identify malicious encrypted traffic. The fingerprint was generated by adding a TCP handshake, SSL/TLS message type, and TCP four-way wave as fingerprint elements to the SSL/TLS handshake state transition-based fingerprint [39]. Encrypted traffic was also identified by generating a feature based on a 2-order Markov chain derived from the generated fingerprint and utilizing it for machine learning.
Zhao et al. [43] classified and identified encrypted traffic by replicating and analyzing traffic characteristics for pre-filtered traffic with header and handshake. Fingerprints based on a hidden Markov model were extracted to classify and identify applications. For this study, classification was performed on web applications, Real-time transport protocol, voice over internet protocol, and video-audio streaming media traffic.

3.3.
Behavioral-based Figure 8. Example of behavioral fingerprint using a combination of sequences.
In 2019, Garn et al. [44] identified a web browser, during a client application, by sending a test set consisting of a specific combination of sequences from the server during the TLS handshake process. When the browser sends a ClientHello message to the server, the server, configured with the proposed framework, sends a specific combination of server-response messages. The browser transmits a response message to the server message. During this handshake process, the browser fingerprints the message sent by the client as a feature vector. In 2022, Garn et al. [45] discussed a two-step method for identifying a browser, and the ClientHello-based fingerprinting was performed while connecting to the browser. The browser was identified using the old method when no unique matching results were found [44]. An example of a behavioral fingerprint using a combination of sequences is shown in Figure 8.

Artificial-intelligence-based fingerprinting
AI-based techniques extract features from encrypted traffic and learn artificial neural networks to perform fingerprinting. Feature extraction utilizes statistical, time series, graph, and hybrid (or miscellaneous) techniques. Statistical analysis extracts features using statistical techniques on data obtained from the traffic, while the time series technique extracts feature through meaningful information arranged in chronological order among data obtained from the traffic. A graph extracts feature using a traffic-tracking graph. A hybrid technique uses several complex techniques and other features. In particular, the technique can identify the transmitted content and the algorithm used for encrypting transmission data.

4.1.
Statistical-based Figure 9. Example of AI-based fingerprinting using statistical-based features.
In 2017, Dubin et al. [46] proposed a method using machine learning for extracting features from video streaming traffic and for classifying the video titles uploaded to video platforms. The bit-per-peak feature, derived from the peak value in the download speed pattern during video streaming, is used. The video content transmitted through encrypted traffic was classified by a third party. Yang et al. [47] proposed a method specifying which images were viewed on performing fingerprinting with Markov chain for fragments of encrypted video traffic in addition to the previous method [46]. The state change diagram of the fragment sequence, modeled for video streaming with a specific title for the video, was uploaded as a Markov chain on YouTube, a video platform operator. Thereafter, the title of the image, viewed through the machine learning model learned with the modeled information, was a YouTube image.
Al-Naami et al. [48] extracted the burst size and time of length, downlink, and uplink from the header of the packet and configured a feature set called bi-directional dependent fingerprinting to study the machine learning model by performing a two-way application on a mobile website.
In 2021, Kanda and Hashimoto [49] proposed a technique for identifying encryption algorithms and libraries on encrypted payloads to compensate for the shortcomings of randomizing or modifying the handshake message parameters to bypass TLS fingerprinting. Based on the test features of NIST SP 800-22 "A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications," features were extracted, and the TLS version, cipher suite, and encryption algorithm were predicted [50]. An example of AI-based behavioral fingerprinting using statistical features is shown in Figure 9.

Time-series-based
Böttinger et al. [51] used the length of The TLS record protocol as a feature to train a machine learning model to generate a classifier. The classifier classified the file format to be transmitted. The Classification was performed on application (ELF), voice (MP3), document (PDF), and image (JPG, WEBM) format files. In real scenarios, it was reported that fragment offset, uncertain compressor state, and symmetric block cipher padding acted as noise.
Zhang et al. [52] proposed features that could be used in techniques for fingerprinting websites encrypted with SSL/TLS and applied the features to a Deep Forest model. The proposed local request and response sequence feature was a value derived by grouping the size and packet number of in/outgoing packets according to various criteria. The developed model was then compared with other models and f1 score.

4.3.
Graph-based Lu et al. [53] used traffic-tracking graphs to fingerprint websites. The traffic-tracking graph was generated by dividing a packet into flows, by referring to a collection of packets with the same IP, and by using the packet length, time interval, directional sequence, and timestamp of the first packet. Subsequently, website classification was performed using the proposed graph-based machine learning model. A system overview of the graph attention pooling network-website fingerprinting (GAP-WF) is shown in Figure 10.

Hybrid & miscellaneous
Richter et al. [54] presented four methods using HMAC size changes based on the selected cipher suite and TLS fragmentation. Five application layer protocols (namely, Hypertext Transfer Protocol, Simple Mail Transfer Protocol, Internet Relay Chat, Post Office Protocol3, and Internet Message Access Protocol) using Bayesian classifiers were classified. The TLS fragmentation, analyzed traffic structures, packet length, and inter-arrival time used as features are shown in Figure 11. Each method was made unique by using different feature extractions. Figure 11. Differences in the use of TLS fragmentation and network traffic statistics by methods [54].
Anderson et al. [55][56][57] attempted to increase the accuracy of malware classification using various features. A dataset with a TLS version, offered cipher suits, TLS extension, selected cipher suite, client key length, sequence of record length, time, and type as features were constructed, and malware detection was performed [55]. Malicious traffic can be classified using a sequence of packet lengths (SPLTs), time, byte distribution (BD), with TLS, HTTP, and DNS metadata as features [56]. In the case of metadata for each protocol, data were extracted from the TLS handshake message, HTTP header, and DNS response without decryption. Malware detection was performed by learning the L1-logistic regression model. Malicious traffic can also be classified using SPLT, BD, and TLS metadata, and a custom feature server certificate was self-signed [57].

Discussion
Based on the techniques discussed so far, a step-by-step TLS fingerprinting technique is presented in Figure 12.
First, in encrypted traffic fingerprinting, the TLS packet between the client and server is analyzed and executed. If the packet to be analyzed is classified by time, it can be divided into the key exchange, server parameter and authentication, and application data phases, which are termed Phase 1, 2, and 3, respectively, for convenience. Phase 1 is understood as a connection establishment attempt phase, Phase 2 is understood as a connection establishment phase, and Phase 3 is understood as a connected phase.
In Phase 1, pre-connection analysis is performed based on the Client/ServerHello fingerprints. In Phase 2, analysis is performed during the connection setup (handshake) using the metadata of the entire handshake phase. In Phase 3, analysis is performed while connected. Phase 1 has the advantage of quick analysis and pre-detection, but it is vulnerable to handshake modulation and is relatively less accurate when the analysis is performed using limited information. Phase 2 needs more information to perform high-accuracy detections and may still be vulnerable to handshake modulation. It has the disadvantage of being incapable of detecting content (e.g., insider information leakage) transmitted after being disconnected. In Phase 3, the information extracted from the application is used to detect the transmitted content based on the connection information that was analyzed until Phase 2. Detailed fingerprinting can be performed using a greater number of phases, but a significant time and computing resources are consumed as more information is used. Therefore, applying a step-by-step technology according to the policy will increase the efficiency of the fingerprinting process. For example, the application phases may be configured differently depending on the importance and type of asset. For systems where identification of "who approaches" is important, the client should be analyzed using Phases 1 and 2, and for important assets that can cause serious damage in case of leakage, all three Phases should be used to identify content and detect insider information leakage.

Conclusions
In this study, encrypted traffic fingerprinting is analyzed by dividing it into fingerprint collection and AI techniques. Several advantages and disadvantages for each technique are identified.
Fingerprint collection techniques have the advantage of easy system construction when the fingerprint generation methods are clear. In addition, they are identified and classified through a fingerprint database comparison, which consumes less time and computing resources. In these techniques, pre-detection is possible because identification and classification can proceed during the handshake process. However, their accuracy is poor when fingerprints are not registered in the fingerprint database or when the client and server falsify and modulate the information used for fingerprint generation.
AI can perform calculations using various features and can infer encrypted and transmitted contents along with the encryption algorithms used. This is advantageous because it can identify leaked contents and report whether the information was leaked during detection. However, detection in advance is difficult because it can target the application layer or collect information on the entire connected traffic and then extract features and read the results.
As shown in Figure 12, a study capable of performing fingerprinting step-by-step should be conducted in the future to offset the advantages and disadvantages of each technique.