Signature Identification and User Activity Analysis on Whatsapp Web through Network Data


 WhatsApp messenger is a popular instant messaging application that employs end-to-end encryption for communication. WhatsApp Web is the browser-based implementation of WhatsApp messenger. Users of WhatsApp communicate securely using SSL protocol. Encryption and use of common port for communication by multiple applications poses challenge in traffic classification for application identification. It is highly needed to analyze the network traffic for the purpose of QoS, Intrusion Detection and application specific traffic classification. In this paper, we have done traffic analysis on the network packets captured through data transfer in whatsapp web. In the result, we have explored the user activities such as message texting, contact sharing, voice message, location sharing, media transfer and status viewing. Packet level traffic analysis of user activities reveal patterns in the encrypted SSL communication. This pattern is identified across SSL packet lengths for WhatsApp media transfer and voice message communication. Other important features WhatsApp is the ability to view the status of the message being sent. We have identified the read and unread message status in these data packets by exposing signatures in the network layer. These signatures are identified with the help of the SSL lengths in the TLS header information of WhatsApp Web network traffic traces. Various other information on WhatsApp traffic presented in our study is relevant to the version of WhatsApp Web v0.3.2386.

demonstrate an inadequacy in the encryption presented by WhatsApp in terms of leakage of information. Enterprises and Institutes can reproduce the methodology formulated in this paper to help impose policies for efficient network monitoring of WhatsApp Web. Along with network traffic pattern analysis, this paper exhibits an extensive study of WhatsApp Web's network traffic by considering traffic trace information, primarily the SSL length, to arrive at a signature that could possibly classify WhatsApp's read receipts.

2.Key Contributions:
2.1) Considering the need for research in whatsapp web data, this paper proposes a study on various user activities of WhatsApp Web and signature analysis of read receipts. 2.2) For our study, user activities such as text-messaging, media sharing, contact sharing, location sharing, voice messaging and viewing statuses, which are cited as "user activities" throughout the paper is included. 2.
3) The whatsapp web traffic network dataset is generated in real-time for various scenarios which contains only SSL data packets.

2.5) Pattern of SSL packet length is analysed to identify significant information about
WhatsApp Web user activities. 2.6) From the TLS record layer length field, dimensions such as read and unread messages from network packet level is explored.
3. Paper organization: The paper first highlights the related work on network traffic analysis and signature identification in Section 4, then proceeds with the procedure of traffic generation, capture and filtering in Section 8. Section 8 and Section 9 discusses the analysis of the captured network traffic. The algorithmic process of the traffic pattern identification and signature verification is explained in Section 6 followed by the conclusion and references.

RELATED WORK
One of the most important aspects of network management has been network traffic classification. With the skyrocketing growth of the Internet, the composition of network traffic is becoming complex and varied. Accordingly, network traffic classification has come a long way from the traditional methods of port-based classification to the Deep Packet Inspection method to flow-based statistical method to host-based behavior classification. Introduction of encryption and obfuscation techniques has resulted in a significant rise in encrypted network applications and traffic in network [15]. According to the Dornger in [16], encrypted network traffic classification can be studied by considering classification based on packet information or based on host/social (popularity of the host which is the number of hosts communicating with it) information.
With the popularization of the Internet and high speed networks, network traffic is being generated at an enormous rate. The introduction of encryption has helped in securing the user's personal information and maintaining their privacy.Consequently, a lot of research is being carried out to test the credibility of encrypted traffic. One way is to identify a signature in the network traffic that corresponds to a particular application or an user activity of an application. Network traffic consists of header information like ports, IPs, payload length, protocol, etc which are currently used to classify applications [1]. Identification of signature in the payload length is a technique commonly used to classify an application with high accuracy [2][3] [4]. Some researchers found a way to automate the process of signature identification by finding a string or hex subsequence in the payload [5] [6]. [7] proposed a behaviour signature for internet traffic identification that assessed the first few packets of multiple traffic flows. This signature was proved to be successful by obtaining a precision of 100% in categorizing ten different applications. [8] was able to discern SSL/TLS traffic using signature matching methods.
Instant Messaging applications are a primary medium for communication on the Internet. Thus, significant research is being done to identify signatures in these applications to validate their encryption and privacy features. [9] displayed a dependable framework to identify Viber traffic over the network layer and also went on to classify voice/video calls and chat messages (voice/text/media) using signatures found in TCP payload sizes. [10] has provided a comprehensive study that investigates information leakages by analyzing packet sizes. Operating System (iOS/OSX) of devices used has been identified based on packet lengths and direction pairs. Language used and plaintext length of the message has been characterized by examining the count of packet lengths and direction pairs. Finally, the study also distinguishes user actions with the help of signatures that have been found in the packet lengths. [11] implements a system to meticulously identify Skype traffic over the network by revealing signatures with regard to the UDP flows of Skype. Skype VoIP calls are detected by finding a signature in the port usage and payload size of the packets in a session [12]. [13] shows that the behaviour of WhatsApp voice calls can be studied by analyzing the WhatsApp traffic between users. An algorithm was also developed to characterize voice calls from other applications such as text messaging or media sharing which considered UDP flow information and traffic time series to perform the classification.
This study is centralized on WhatsApp Web network traffic to identify signatures of read receipts based on the SSL packet length field in TLS header. It also exposes WhatsApp's end-to-end encryption by revealing some user information in the network layer.
There are many researches on the encryption aspect of instant messaging applications to validate their secured communication methodologies. TCP payload sizes have been used to distinguish Viber traffic through a framework, which classifies voice/video calls, and chat messages over the network layer by Sudozai et al [17]. Scott et al. [18] used packet size, direction pairs and count of packet lengths of iMessage (Apple's messenger application) network traffic to perform various classifications like OS of device, language used and plain text length. Jayeeta et al. [19] presented a case study on Google Hangouts by studying the domain name servers (DNS) and further using it to classify the Google application traffic as Gmail, Hangout or Google plus. This is done by following a packet-based identification (ports and packet lengths). Alshammari et al. [21] used a flow-based feature set consisting of 22 features such as protocol, duration of flow, number of packets in forward direction, mean forward packet length to identify VoIP traffic of Gtalk and Skype. The authors performed this classification with the help of three machinelearning algorithms, namely, C4.5, AdaBoost and Genetic Programming and concluded that C4.5 had the best classification performance. Wongyai et al. [22] conducted traffic analysis of packets that are generated when the Facebook homepage is being loaded. After investigating these packets, the authors summarized the number of components loaded and the order they are loaded in, number of TCP streams used and the number of servers accessed while the Facebook homepage is being loaded.
Recently, new studies are being conducted on the widely known WhatsApp application since there is not a lot of research work present. Yanjie Fu et al. [23] tried to classify user activities of WeChat and WhatsApp based on the packet lengths in the traffic and the time delay between the sessions. Antonio et al. [13] proposed a blind traffic classification for VoIP calls in WhatsApp messenger in android. They have obtained results based on the data rate and protocols that were used by the WhatsApp messenger application.
In this paper, the aim is to study various user activities of WhatsApp Web. These activities include, text-messaging, media sharing, contact sharing, location sharing, voice messaging and viewing statuses, which are cited as "user activities" throughout the paper. It is done through the inspection of the network traffic packets. It also concentrates on the SSL packet lengths to identify any significant information. To the best of authors' knowledge, not much work has been reported in this area. PROPOSED METHOD

Whatsapp User Activity Analysis:
The raw network data is monitored through Wi-Fi port and captured. Then the captured packets are saved as Pcap files through wireshark. The Pcap data contains data of all applications in the network from which the whatsapp data is alone filtered through the SSL header file. Then the network traffic pattern is analyzed. After the signatures that have been identified by traffic analysis, an algorithm is created to help classify the different read receipts from the traffic traces. In order to apply this algorithm, a methodology is followed which is represented by the flow chart given in Figure 2. Since the signature is obtained from one direction of the communication (WhatsApp server to sender), traffic traces are converted to unidirectional flows. A flow is a 5-tuple that consists of source IP, source port, destination IP, destination port, and transport protocol. 6.2) Extraction of WhatsApp Server to Sender Flow: Many unidirectional flows are created from each source IP to each destination IP. From these set of flows, only the WhatsApp server (source IP) to sender flow (destination IP) is carefully extracted as the signature is found only in this flow. 6.3) SSL Length retrieval: The unidirectional flows consists of various header information. However, only the SSL lengths are retrieved from the flow because the signature identified based on it.
6.4) Algorithm Implementation: The purpose of this algorithm is to verify whether the SSL lengths from the WhatsApp server to sender flow contain the signature. The retrieved SSL lengths are grouped in an array which is fed as the input to the algorithm. The pseudocode of the algorithm used is shown below: The algorithm parses through the array of SSL lengths and compares each value of the array to the identified signatures. When a value of the array matches the first value of either signatures, the subsequent values of the array are checked sequentially against the read receipts signatures. If a match is found for one of the signatures, that particular read receipt is returned by the algorithm.

EXPERIMENTAL SETUP
The whatsapp web traffic data is analyzed in the real time data captured as below:

TRAFFIC COLLECTION
Availability of data is the basis of any traffic classification research. The focus of this paper is on identification of traffic pattern and signature in WhatsApp Web user activities. Hence, data or traffic traces pertaining to these activities of WhatsApp Web communication is the primary requirement.

8.1)Traffic Generation
The initial requirement is WhatsApp Web network traffic which should be generated without any interference from other applications. The network traffic is introduced by exchanging any message or media with a recipient using WhatsApp Web as the medium. It is made sure that only WhatsApp Web is running in the host machine. No other applications should be active in the background to avoid noise. The version of WhatsApp Web used in this study is v0.3.2390.
The common user activities in whatsapp web are described in Table 1. It is to be noted that calling features like voice call and video call are not available in WhatsApp Web. There are no traffic trace data of WhatsApp user activities available in literature. Therefore, it is required to devise a method for WhatsApp Web traffic generation. The traffic, thus, generated is considered as the ground truth in this paper. In this traffic generation activity, it is ensured that network traffic pertaining to only WhatsApp Web is generated by the machine. This is done by ensuring that no other application responsible for generating network traffic is running on the machine while WhatsApp communication is happening. In addition, apart from WhatsApp Web traffic, there is no browsing traffic is generated. It is also ensured that WhatsApp communication is only between two communicating parties. The version of WhatsApp application in use is mentioned in Table 2. In this scenario, both the sender and receiver exchange media files like documents, images, audios, videos using WhatsApp Web. Both the sender and receiver use Wi-Fi connectivity. It is to be noted that, while exchanging media files, the sender needs to transfer a new file every time because WhatsApp maintains a local cache of files that are already exchanged and does not retransmit it. In case a file that has been sent earlier needs to be resent, it has to be deleted from the conversation chat history in WhatsApp. It is also ensured that only one media file is transferred at a time and text messages are not sent.

8.1.1.2) Text Messages:
In this scenario, the sender and receiver exchange only text messages of any length.

8.1.1.4) Voice Message:
In this scenario, the sender and receiver exchange only voice messages.

8.1.1.5) Location Sharing:
In this scenario, only the sender can send the location using the mobile application as WhatsApp Web does not provide the ability to share locations. 8.1.1.6) Status: In this scenario, the traffic is generated by viewing the status uploaded by any of the user's contacts. It is to be noted that WhatsApp Web does not provide the user the ability to upload media status.

8.1.2) Sender using WhatsApp Web and Receiver using WhatsApp Messenger on Wi-Fi or vice versa:
In this scenario, the traffic is generated in the same way as the first case for all activities with the only difference being that the receiver uses the WhatsApp messenger on a mobile device.

8.1.3) Sender using WhatsApp Web on Wi-Fi, Receiver using WhatsApp Messenger on mobile data or vice versa:
In this scenario, the traffic is generated in the similar way as the first case for all activities with the only difference being that the receiver uses WhatsApp messenger on a mobile device and is connected to the Internet using mobile data instead of Wi-Fi.
For signature identification of these whatsapp data, three different experiments are carried out to analyze the three read receipts as below: 8.1.3.1) Receiver is offline: The sender sends a message or media in WhatsApp Web while the receiver is offline. This gives a single check mark for the message that has been sent. 8.1.3.2) Receiver is online but not chatting with the sender: This experiment is same as the above but the receiver is online and hasn't read the message yet. This gives two check marks for the message that has been sent. 8.1.3.3) Receiver is online and chatting with the sender: In this experiment, the receiver is in the sender's chat when the message is being sent. This makes the message being instantly read by the receiver and gives two blue check marks for the message in the sender end.

8.2) Traffic Capture
The WhatsApp traffic, thus, generated needs to be captured for further analysis. The traffic is always captured at the communicating endpoint, which uses WhatsApp Web irrespective of sender or receiver. Wireshark is a GUI based packet sniffing tool, used to capture the traffic of various user activities of WhatsApp Web, store in traffic trace files, and analyze the network protocols and various other header information packets. It allows interactive viewing of packet data from live network or from a previously saved capture file. Wireshark's native capture file format for the traffic trace files is pcap format, which is also the format used by tcpdump.

8.3) Automation of Traffic Generation and Capture
In order to generate and capture the WhatsApp communication traffic with minimum manual user intervention, automation of the generation and capture process is devised. This allows huge collection of traffic traces at a shorter period and makes the traffic generation and capture process uniform and less cumbersome. Macro Recorder [15] tool is used to automate the traffic generation process. Macro Recorder is a record and playback automation tool for Windows that records the mouse and keyboard actions and replay them for a specified number of times. The tool is mainly used for software testing and system maintenance but, in this study the tool serves the purpose of automating the traffic generation and collection process. Macro Recorder version 1.0.58f is used in this paper.

8.4) Traffic Filtering
After the generation and capturing of network packets, there is still a possibility for the existence of other packets in the captured traffic trace file that belong to non-WhatsApp applications. This is due to several undetected operating system background applications. This noisy traffic needs to be removed from the captured traffic trace file. Therefore, filtering of traffic is done to ensure that all packets in the captured traffic trace file are relevant to the WhatsApp Web communication only. It is to be noted that WhatsApp uses SSL/TLS protocol for the packet transfer. The filtering of traffic is achieved through Wireshark such that all SSL traffic is retained. In addition, it is confirmed that either of the source IP or destination IP of the packets in the traffic trace file belongs to the pool of WhatsApp server IPs [16]. The communication between a sender and a receiver always occur with a server as a medium. All the traffic traces, generated due to the activities, from the sender is first passed to the server and then it reaches the receiver from the server. This server IP can be used to excerpt the WhatsApp traffic traces from the non-WhatsApp SSL traffic traces for further analysis.

9.RESULT ANALYSIS
The filtered WhatsApp Web traffic is analyzed to identify the presence of patterns in the packet stream and the user read signature.
Traffic pattern analysis: The WhatsApp Web traffic that has been captured contains only SSL packets as WhatsApp uses SSL for communication. As SSL data is encrypted, inspecting the payload of SSL packets does not reveal any pattern. The information available in the SSL header for multiple packets of an SSL captured traffic trace and across multiple traffic traces are analyzed. An important parameter in the SSL header is the length field of the TLS Record Layer [17]. This length field for consecutive packets of a WhatsApp capture traffic trace is analyzed. The research findings post traffic analysis is elaborated below:

9.1) Media Transfer and Voice Messages:
Analysis for multiple media transfer and voice message traffic traces has revealed that WhatsApp Web uses TLSv1.2 and TLSv1.3 for these communications. Further inspection of traffic traces using TLSv1.3 protocol revealed a specific signature or pattern across the SSL length field. This SSL length field is cited as "SSL packet length" for the rest of the paper. It is observed that certain sequences of SSL packet lengths always occur in these captured WhatsApp traffic traces. Few of the length sequences have been presented in Table 3 and they form the pattern for the identification of media transfer and voice messages.
The pattern identified always starts with the SSL packet length 69, which occurs between 5 th to 13 th packets from the WhatsApp SSL client hello packet. This SSL packet length may be followed by the SSL packet lengths 26 and/or 30. x in the pattern represents any single SSL packet length. The regular expression for all possible patterns can be written as: The regex is written with respect to PCRE format [18] that is a commonly used standard for regular expressions. It is also found that, certain media transfer and voice messages traffic traces use TLSv1.2. However, these traffic traces do not uncover any relevant pattern or other useful information.

9.2) Text Messages and Contact Sharing:
Following the analysis of traffic traces of text messages and contact sharing, it is observed that both these activities use TLSv1.2 protocol. There is no presence of significant signatures or patterns in the SSL packet lengths.

9.3) Status:
Upon studying the traffic traces collected for media statuses, it is found that WhatsApp Web uses TLSv1.3 while viewing media statuses and follows the same pattern as mentioned in Table 3. It is to be noted that WhatsApp Web uses TLSv1.2 for statuses that have already been viewed by the user and no pattern is identified in the SSL packet lengths.

9.4) Location Sharing:
It is observed that unlike other user activities that use TLS, WhatsApp Web facilitates location sharing using GQUIC protocol. As the location is shared using Google Maps, the Google servers are accessed instead of WhatsApp. Therefore, the location is transmitted through Google server IPs.
Signature Identification: WhatsApp is SSL encrypted, hence, the application payload data does not reveal any signature. But, after inspecting the TLS record layer length field in all the capture files, signatures are seen. It is to be noted that the signatures are identified only in the traffic traces using TLSv1.2 protocol. The scenario is illustrated in Figure 3. The packets of the traffic traces are marked in Figure 3 [A, B, C]. They are found to be common in all the WhatsApp capture files with the source being a WhatsApp server and the destination being the sender.  Figure 4 depicts the SSL length for consecutive three packets for the scenario when receiver has seen the image sent by a sender and the sender has received the read receipts in the form of two check marks in blue color. These sequence of packet length form a signature for the read receipt indicated by two blue check mark This signature always occur towards the end of WhatsApp server to sender traffic trace. In each experiment, two significant signatures are found for each read receipt.  Both the signatures found during traffic analysis for all three read receipts are tabulated in Table4.

CONCLUSION
The research conducted in this paper highlights the presence of pattern in encrypted whatsapp web communication. Network traffic analysis at packet level reveal distinct pattern for various whatsapp web user activities. This pattern is revealed on analysis of the SSL packet lengths for the user activities of media transfer, statuses and voice message in WhatsApp Web. Also, the paper proposes a methodology to identify signatures in SSL lengths to classify the various read receipts provided by WhatsApp. This work can be further extended for whatsapp communication over mobile application as well.  Signature revealed as SSL packet length for two blue check marks in SSL length