A Survey of techniques for fine-grained web tra ffi c identification and classification

: After decades of rapid development, the scale and complexity of modern networks have far exceed our expectations. In many conditions, traditional tra ffi c identification methods cannot meet the demand of modern networks. Recently, fine-grained network tra ffi c identification has been proved to be an e ff ective solution for managing network resources. There is a massive increase in the use of fine-grained network tra ffi c identification in the communications industry. In this article, we propose a comprehensive overview of fine-grained network tra ffi c identification. Then, we conduct a detailed literature review on fine-grained network tra ffi c identification from three perspectives: wired network, mobile network, and malware tra ffi c identification. Finally, we also draw the conclusion on the challenges of fine-grained network tra ffi c identification and future research prospects.


Background
Web traffic refers to the network traffic transmitted through hypertext transfer protocol (HTTP) or HTTP over secure socket layer (HTTPS). For the advantages of high flexibility and strong expressiveness, HTTP/HTTPS protocol has become the main protocol for new emerging websites or applications. As a result, web traffic has been the main traffic in the Internet since the mid-1990s [1]. Although the proportion of web traffic was exceeded by P2P traffic at the beginning of this century [2,3]. However, web traffic surpassed P2P traffic quickly and contributing more than half of Internet traffic [4][5][6] for the following two reasons. (1) The P2P protocol strictly controlled by network operators due to the excessive consumption of network resources; (2) The rapid rise of rich media web sites represented by Youtube and Flicker leads to web traffic to grow continuous. In addition, most mobile Internet applications built on the web framework for its fast development and cross-platform features, which result in the ratio of web traffic in some mobile networks achieved 90% [7]. With the increase of smart mobile terminals, the growth rate of traffic in mobile internet far exceeds the traditional wire networks [8], which will also promote the proportion of web traffic increasing continuously. As web traffic accounts for a large proportion in various network, dividing web traffic into one category cannot meet the needs of network management. Therefore, it is urgent to identify web traffic in fine-grained with various applications including traffic engineering, billing, service recommendation, network planning, and optimizing. In this article, we provide a comprehensive overview of fine-grained network traffic identification. Then, we conduct a detailed literature review on fine-grained network traffic identification from three perspectives: wired network, mobile network, and malware traffic identification.
In recent years, fine-grained web traffic identification has attracted more and more attention from researchers, and the related results have also emerged continuous [8][9][10][11][12][13][14][15][16][17][18][19]. This article is a summary of fine-grained web traffic identification in the past few years. We summarize the rules obtained and the challenges faced.

Related survey
A number of surveys on coarse-grained or fine-grained network traffic have been conducted. Previous study, including the works of T. T. T. Nguyen et al. [20] and A. Callado et al. [21], surveyed the field of network traffic classification in coarse-grained. With the widespread use of HTTP(s) protocol both in the wired and mobile network, there have emerged many research works concerning HTTP(s) traffic [22][23][24][25][26][27][28][29][30][31][32][33], P. Velan et al. [25] focused on the encrypted traffic classification and analysis, and D. Acarali et al. [26] studied the HTTP-based botnet traffic. These surveys studied network traffic in both coarse-grained and fine-grained. Besides, with the widespread use of encrypted techniques in the network, deep learning (DL) and deep reinforcement learning techniques (DRL) are used to identify network traffic in fine-grained. The works of G. Aceto et al. [32] and A. Shahraki et al. [33] surveyed the applying of DL and DRL techniques in the traffic engineering, respectively. Table 1 is a summary of existing related surveys.

Research scope and contributions
In this work, we conduct a comprehensive survey of fine-grained web traffic identification. First, we describe the scope of our concerned fine-grained web identification, and why the topic is to be chose. Second, we provide a comprehensive overview of the existing fine-grained web traffic identification is provided. Then, recent research of fine-grained traffic identification are systematically surveyed from three perspectives: wired network, mobile network, malware traffic. Finally, we conclude the existing challenges and future research perspectives for fine-grained web traffic identification. Our key contributions are as follows: (ii) A detailed literature review on wired networks, mobile network and malware traffic.
(iii) For the systematic survey, we conclude the existing challenges and future perspectives for the finegrained web traffic identification. The conclusion can provide a lot of inspiration for the future researchers.
A comprehensive overview of network traffic identification is provided in section 2. Then, a detailed literature review of recent fine-grained web traffic identification is proposed in section 3 from three aspects: wired network, mobile network, and malware traffic. In section 4, the evaluation criterion for fine-grained identification is presented. Our insight into the challenges and future perspectives of fine-grained web traffic identification are presented in Section 5. In Section 6, a conclusion is drawn. Due to the large number of acronyms involved, Table 2 lists the abbreviations used in this paper.

Introduction of network traffic identification
In the early days of the Internet, there were not many services in the network, and each service was assigned a fixed port number by Internet Assigned Numbers Authority (IANA). Therefore, most of the traffic at this stage can be identified through port-number. With the development and popularization of the Internet, there are more and more services in the network. Emerging internet applications do not necessarily use the service ports recommended by the IANA organization, resulting in the gradual failure of the method of using ports alone for traffic identification [34]. In order to improve the accuracy of traffic identification, deep packet inspection has gradually become popular. Deep packet inspection identifies the flow by detecting the payload characteristics in the packet. Compared with the port number-based methods, the accuracy of identification is greatly improved [35]. With the development of network technology and the increasing attention paid to network security issues, more and more internet content providers use encryption protocols to communication,which resulting in a decrease in the accuracy of traditional port number and payload-based traffic identification methods [36]. At the same time, the research of using machine learning methods for traffic identification and classification has gradually increased [37][38][39][40][41][42][43]. Nowadays, deep learning algorithms have achieved important breakthroughs in the fields of image, speech, and text. More and more researchers use deep learning in the field of traffic classification [15,44].

Introduction of web traffic identification
The traditional traffic identification method stays at dividing the web traffic into a relatively coarsegrained category without identifying the specific applications carried on it. In the modern network environment, the applications carried by web traffic have more than simple web browsing and include multiple type of applications such as multimedia playback, web mail, and file downloads. Therefore, researchers have begun to try to divide web traffic in a more fine-grained way through a variety of different methods. With reference to the classification method of traditional traffic identification, the related research on web traffic identification can be conclude from two aspects, identification based on statistical features and identification based on packet analysis. Drawing on the traditional idea of traffic identification based on statistical characteristics, some researchers regard HTTP session as a data stream similar to TCP connections, and extract statistical characteristics such as packet length, average arrival interval, and the number of HTTP requests, and use clustering, machine learning algorithms such as C4.5 and SVM to recognize web traffic [38][39][40][41][42]. For the web traffic identification method based on statistical characteristics does not rely on the understanding of the interactive content of web applications, and it is more adaptable to the identification of private protocols or encrypted traffic. However, due to the limitations of machine learning technology, web traffic can only be divided into rough categories, such as web video, web mail, file download, botnet, ordinary web browsing, etc., make it difficult to associate traffic to specific applications. Therefore, the current fine-grained identification of web traffic still mainly relies on methods based on packet analysis. These methods mainly use the host, URL, User Agent, ContentType, and other fields in the HTTP request and response, as well as the content to be transmitted, to identify web traffic [45][46][47][48]. Since the header fields of the request and response messages of the HTTP protocol are in readable text form, this provides a useful place for using keyword matching or text pattern matching to associate web traffic with known applications or service providers. To cope with the large number and frequent changes of web applications, some researchers have also proposed several methods for automatically extracting application fingerprints to improve the problem of relying too much on manual extraction of application features based on packet analysis and identification methods [49,50].
The identification object of web traffic refers to the input form of content, including flow-level, packet-level, host-level and session-level web traffic. The corresponding identification object is determined according to the requirement of traffic identification. Among them, the host-level and sessionlevel objects are the most widely used. The session level mainly focuses on the characteristics of the session and the arrival process, such as the large amount of data in response to the video request, and the transmission of multiple sessions for one request. The session-level characteristics include the number of session bytes and the duration of the session.The host-level mainly focuses on the connection mode between hosts, such as all traffic communicating with the host, or all traffic communicating with a certain IP and port of the host. The host-level characteristics include the degree of connection, the number of ports, and so on.The flow level mainly focuses on the characteristics of the flow and the arrival process. The IP flow can be divided into one-way flow and two-way flow according to the transmission direction. The packets of one-way flow come from the same direction; the two-way flow contains packets from two directions, and the connection may not end normally, such as flow timeout. Sometimes a bidirectional flow requires a complete connection between the two hosts from the beginning of the SYN packet to the end of the first FIN packet. Stream-level characteristics include stream duration, number of stream bytes, and so on. The packet level mainly focuses on the characteristics and arrival process of data packets. The packet-level features mainly include packet size distribution and packet arrival time interval distribution.  The type of web traffic identification refers to the output form of the identification result. The identification type is determined according to the requirements of the flow identification. The web traffic can be gradually refined from the attributes of applications, websites, protocols, etc., and finally realize application identification, website identification, and abnormal traffic identification. We can describe the five kinds of identification mentioned above in detail as follows. (1) Application identification is to identify the application to which the traffic belongs, such as Google Mail, YouTube, etc. (2) Website identification is to identify the name of the website to which the traffic belongs. (3) Abnormal flow identification is to identify malicious traffic; (4) Encrypted and unencrypted traffic, identify which traf-fic is encrypted, and the rest is unencrypted. (5) Protocol identification is to identify the encryption protocol used for encrypted traffic, such as SSL, SSH, IPSec.

Problems of web traffic identification
In recent years, in the research of using the method of packet analysis to identify network traffic, we have found that there are a large number of web traffic in the network that have no clear meaning or semantically ambiguous in the HTTP header. These web traffic cannot use existing technology for identification. To illustrate this problem intuitively, we compare one type of unknown web traffic with the overall traffic. We discover that the web traffic with an unknown IP address in the host field accounted for 20.9% of the total, which is almost the same as the second-ranked business traffic. If we consider the unknown traffic with ambiguous fields such as Host and User Agent, the proportion of web traffic that cannot be processed by the existing recognition technology will be even higher. For these unknown flows, although we cannot identify them by means of message interpretation, we can speculate and identify them based on the relationship between web browsing records. The relationship between web browsing records is closely related to the behavior of users clicking on web pages. Therefore, identifying users' clicks behavior is of great help to the identification of unknown traffic. The latest research on user click recognition is [9,10,51]. In reference [9], the StreamStructure recognition method is proposed. This method combines the characteristics of time and file type, divides web browsing records into different blocks, and then determines the area of the first HTTP request of the block is the user click. Literature [26] proposed the ReSurf method. Compared with literature [9], this method has two main innovations. Firstly, it is proposed that the size of HTML documents is usually larger than V bytes. Secondly, after obtaining user access trajectories, backtrack the user's initial URL request in chronological order. The innovations of literature [51] can be concluded as follows: (1) unlike [9,10] that separates user clicks from numerous URLs, literature [51] directly finds out the URLs clicked by users from referrers; (2) count the number of referrers from all records; Considering the complexity of web pages, a web page usually contains multiple embedded objects, and a URL clicked by a user should appear in the referrer of multiple records; (3) similar to [9], the URL is used The request file extension is to exclude those non-user clicking on the URL, such as: URL file extension is .js, .css, .swf. Fourth, because some web page advertisements are designed with an inline frame, the characteristics of the frame column object are very similar to user clicks, AdBlock library is used to filter non-user clicks on advertisement requests. Although the above studies have achieved satisfactory results in their respective network environments, they all have a common problem: these studies all use IP addresses to distinguish users. However, it is common for multiple users to share an IP address in a fixed network, so the research results have certain limitations. This article will use the data in the mobile Internet to study the user's click behavior. In the mobile Internet, each device has an IMEI number. We can separate the user's traffic according to the IMEI number, and correspond the flow records to the users one-to-one.

Introduction of network traffic datasets
There are more and more public available network traffic datasets with network security drawing more and more attention. Table 3 lists some common datasets in the field of network traffic classification and identification. UNIBS is the group of telecommunication networks from University of Brescia. The traces from this group were mainly collected on the edge router of campus network of the University of Brescia. CIC is the group of Canadian Institute for Cybersecurity. The datasets from this group are mainly foucus on intrusion detection. UMass is a trace repository which maintained by the Laboratory for Advanced System Software, and everyone can contribute to the repository. CAIDA conducts network research and builds research infrastructure to support large-scale data collection, curation, and data distribution to the scientific research. WIDE is a traffic data repository maintained by the MAWI Working Group of the WIDE Project.

Fine-grained web traffic identification in wired network
Wired network internet connectivity is a mature service in many countries. According to the different areas of wired network, it can be divided into residential broadband network, school network and work network. All these kinds of network have rich access which encourage users to closely involve network into their lives-from checking the weather or breaking news to shopping and banking or to communicating with family and friends in many aspects. However, the nature of these network differs from each other in use. For example, users of residential broadband connections will often have more entertainment needs than users of work environment, and school broadband connections will often have more study needs than other environment, and work broadband connections may have strict acceptable policies that may regulate their access at work, such as prohibitions against accessing certain web sites or employing certain applications. The identification of web traffic in wired networks is of great significance to network operators' management of the network. At present, there are mainly four types of methods for web traffic identification in wired networks: (1) Based on pattern matching; (2) Based on statistics; (3) Based on ML or DL; (4) Based on graph theory.

Pattern matching based methods in wired network
Internet content providers usually use 80, 8080, and 443 as the access ports for websites. In addition, the HTTP header has a unique fingerprint feature, which can be used as a basis for traffic identification. By using pattern matching method, Literature [57] is the first paper to study residential broadband network. The authors in this paper describe observations from monitoring the network activity for res-idential DSL customers in an urban area and reveals a number of surprising results, such as HTTP-not peer-to-peer-traffic dominates by a significant margin, more often than not the home user's immediate ISP connectivity contributes more to the round-trip times the user experiences than the WAN portion of the path, etc.

Statistics based methods in wired network
In the early stage of the research on fine-grained identification of web traffic, researchers usually described web traffic with the statistical characteristics of complete flow. Since these statistical features are based on the description of the complete flow, they can only be applied in offline recognition. Therefore, in recent years, the extraction of early features of traffic has become the focus of research. In a real application scenario, it is meaningful only if the identifier extracting its characteristics in the early stages of traffic occurrence. Bernaille et al. [58] pointed out that the first few data packets have a decisive significance in identifying the type of flow. Generally speaking, the first few data packets of the flow are communication the negotiation process between the two parties, and this negotiation process is completely determined by the application itself, which means that the first few data packets have the most obvious application-specific characteristics. This discovery provides a technical basis for online real-time identification of web traffic. Then, they continued to apply semi-supervised algorithms to the early recognition of Internet application traffic [59] and the early recognition of encrypted traffic [60] to do in-depth research. Este et al. [61] studied the early simple characteristics of traffic and found that the early characteristics of traffic contained rich information about application behavior characteristics. They applied mutual information and other methods to analyze the RTT, packet size, packet arrival time (IAT), and packet direction of several data packets in the early stages of the flow. The analysis results show that packet size is the most effective feature of early traffic. The research results provide experimental basis for the early feature extraction and recognition of traffic. Huang et al. also studied the early behavior characteristics of Internet applications, and conducted effective identification experiments based on these characteristics [62]. They further studied the conversation and negotiation behavior characteristics of different applications in the early stages of traffic. And based on these early characteristics, the ML model is used for early traffic recognition, and the ideal recognition effect is achieved [63]. Nguyen et al. [64] extended the concept of early recognition. They extracted statistical features from a small sequence of packets at any time, and applied C4.5 decision trees and naive bayes classifiers to online games and IP voice traffic and obtained a high recognition rate. He Gaofeng et al. [65] proposed an identification method based on TLS fingerprints and message length distribution, and successfully applied this identification method to online identification of Tor anonymous traffic. Chen Liang et al. [66] extracted features from NetFlow records and applied these features to realize high-speed flow identification. Dong Shi et al. [67] put forward an efficient flow identification model by studying the behavioral characteristics of traffic in ports, message lengths, and flow record preference. These research works are of great significance to the early and rapid identification of web traffic.

ML or DL based methods in wired network
In recent years, the flow identification method based on ML or DL has attracted more and more researchers' attention. These methods extract a series of independent statistical features of the payload from the network traffic, such as the number of packets, the amount of bytes carried by the packets, the duration of the flow, the average interval between packet arrivals. Then, researcher should uses ML's method to train a recognition model to perform the next step of traffic recognition. In these methods, network traffic is characterized by a series of traffic statistics features. Researcher obtains a recognition model by training some known application traffic data, which can then be used to identify unknown network traffic. From the perspective of data mining, ML can be divided into two class: supervised learning and unsupervised learning, which correspond to classification and clustering techniques respectively. For supervised learning, we need to provide a known data set for the target problem firstly, called the training data set. The function of the data set is to train a classification model, such as deep neural networks (DNN), support vector machine (SVM) and decision tree etc. The training process is general an iterative process. The parameters of the theoretical model are continuously adjusted through random optimization or analytical methods to make it as close as possible to the real situation of the training data set. After the model is trained, it can be used to identify unknown samples. This process is called testing or actual classification.
The feature description and extraction of traffic samples are the basic problems that need to be solved when using ML methods for traffic identification. At present, researchers use the statistical characteristics of the flow to formally describe the description of the flow sample. The effectiveness of this method is based on the following two assumptions: (1) The traffic of different applications has certain statistical characteristics at the network level, such as the duration of the flow, the idle time of the flow, the average between packets interval time, packet length; (2) The traffic characteristics of each application are unique, so it can be used to distinguish different network applications. In the mid-1990s, Paxson [69] used the statistical characteristics of streams to identify a series of TCP network applications.Then, Dewes et al. [69] analyzed the Internet chat system through a series of statistical characteristics including the duration of the stream, the average interval between groups, and the size of the group. A large number of subsequent studies [70,71] have shown that statistical characteristics were quite effective in the network traffic identification. Theoretically, although the statistical characteristics of streams can also be confused by disguise, it is very difficult in practice compared to technologies such as payload encryption. Literature [18] is a recent paper studying network traffic classification by using of deep convolutional recurrent autoencoder neural networks. The author find that the traffic classifier obtained by stacking the autoencoder with a fully-connected neural network, achieves up to a 28% improvement in average accuracy over state-of-the-art machine learning-based approaches.This is a huge improvement in the field of traffic classification.

Graph theory based methods in wired network
Traffic dispersion graph (TDG) is a common used method to represent network traffic. Each node in the graph is an IP address, and each edge represents a specific interaction between two nodes. In the early days when TDG was proposed, TDG was mainly used to solve network security problems, such as intrusion detection [72] and worm propagation [73,74]. In reference [74], TDG is applied to the backbone network to study the interaction within the network. Its purpose is to automatically group and analyze network applications using information about the degree and port distribution of network applications. Besides, TDG could have a wider range of functions and could be directly applied to traffic classification.

Fine-grained web traffic classification in mobile network
In recent years,the rapid growth of smartphone users has led to the vigorous growth in the traffic volume of mobile networks. According to the prediction from Cisco, mobile data traffic will grow at compound annual growth rate of 46% from 2017 to 2022 [75]. Similar to wired networks, there are also various types of network traffic in mobile networks, including web traffic, P2P traffic, and network traffic based on other proprietary protocols. But research shows that web traffic is still the mainstream [12]. In some mobile networks,the ratio of web traffic even exceeds 90% [76]. In addition, new apps on mobile networks generally use HTTP to provide services to the public, further boosting the proportion of web traffic in mobile networks. Therefore, using traditional methods such as ports-based and payloads-based methods to identify web traffic can only identify web traffic in coarse-grained. In the condition of the web traffic accounts for up to 90% of total traffic, identifying web traffic as a type is disadvantage for network operators' management. Therefore, fine-grained web traffic identification is meaningful for operators to perform network management, including: traffic engineering, billing, service recommendation, network planning and optimization. There are mainly four types of methods for web traffic identification in mobile networks: (1) Based on pattern matching; (2) Based on statistics; (3) Based on ML or DL.

Pattern matching based methods in mobile network
There are three research directions of fine-grained HTTP traffic classification in mobile network by using pattern matching: (1) classify HTTP traffic into different applications (such as web browsing, E-mail and Stream) [8]; (2) associate HTTP traffic with a specific website [13,77,78]; (3) describe and model HTTP traffic in the dimensions of operating system and device [79]. The first paper on fine-grained HTTP traffic is [8]. This paper divides HTTP traffic into 14 categories in accordance with different application activities. Then, it brings several works to analyze HTTP traffic from different perspectives.The authors in [45] analyze the usage of HTTP-based applications on residential broadband Internet and find that the HTTP traffic dominates the whole downstream traffic. On the basis of the traffic similarity, the author proposed a classification scheme in [46], which can classify various traffic types in a single application. Reference [79] propose a detailed measurement study on the HTTP traffic characteristics of cellular network from the perspective of operating systems as well as device-types. These measurement study will helpful for network operators managing their network resources.

Statistics based methods in mobile network
Statistics based methods were widely used in web traffic classification. The authors in [77] studied the websites in the cellular data network and obtained nine different traffic distributions during the day. Reference [80] describes the mobile Internet traffic generated by multiple operating systems. However, this work only analyzes the traffic dynamics and application usage in one day, so it is impossible to find the characteristics in the billing cycle, which is very essential for billing and network planning. In [81], the authors describe and model Internet traffic dynamics from two aspects: device type and application. For different user markets: ordinary and commercial consumers, this approach is limited to coarse-grained description of these two types of smart phone devices. Reference [82] focuses on understanding how, where and when applications are used compared with traditional web services. However, the data sets used were collected in 2010, some conclusions may change now.

ML or DL based methods in mobile network
In 2016, Taylor et al. [83] proposed classification based on burst data streams, considering the two directions of data stream transmission (source and destination address swapping), respectively, count the packet size sequence of the stream, and calculate the average value for each sequence. There are 18 statistical features such as, minimum, maximum, quantile, etc. Finally, the support vector regression algorithm and random forest algorithm are used to achieve a classification accuracy of 99%. In 2019, Shen et al. [84] proposed a decentralized application recognition method, which proposed to use the kernel function for feature fusion based on the statistical characteristics of the two-way data stream, and then further feature screening, and finally achieved a classification accuracy of 92%. The main disadvantage of traffic classification methods based on machine learning is that they require expert experience to extract and filter features. Therefore, these methods are time-consuming and expensive, and are prone to human error. As a result, researchers gradually set their sights on deep learning that can learn features independently.
Traffic classification methods based on deep learning are divided into two categories: based on the original byte characteristics of the data packet and based on the sequence characteristics of the data packet in the flow. The method refers to classification based on the original byte characteristics of the data packet the input of the classifier is the original byte content of the data packet. The method based on the characteristics of the data packet sequence in the stream means that the input of the classifier is the data packet size in the stream, the packet time interval sequence and other characteristics. The DeepPacket proposed by Lotfollahi et al. [85] is a representative of the deep learning method based on the original byte characteristics of the data packet. It proposes to use each data packet as an input sample, and does not require expert experience to extract features, only the original bytes of the data packet As features, the classification model is a one-dimensional convolutional neural network (1DCNN) and a sparse automatic encoder (SAE), and finally achieved a classification accuracy of 98%. Wang et al. [3] proposed to use the first 784 bytes of each data stream (one-way stream/two-way stream) as the model input, based on one-dimensional convolutional neural network (1DCNN) and two-dimensional convolutional neural network (2DCNN), respectively Experiments with two models have shown that 1DCNN has a better effect and can reach an accuracy rate of more than 90%. Li et al. [86] introduced recurrent neural network (RNN) into network traffic classification, and designed a new neural network-byte segment neural network (BSNN). BSNN directly inputs data packets as a model. The experimental results show that in the process of classifying 5 protocols, the average F1-score of BSNN is about 95.82%. Xie et al. [87] proposed a flow classification method SAM based on a self-attention mechanism, using the original bytes of each packet header as the model input. This method achieved 98.62% and 98.62% in protocol recognition and application recognition, respectively. 98.93% F1score average value. The FS-Net proposed by Liu et al. [88] is a representative of the deep learning method based on the characteristics of the packet sequence in the stream. The timing feature uses the packet size sequence in the stream. Based on this, an automatic encoder (auto-encoder) encoder) reconstruction mechanism, this reconstruction mechanism enables the model to learn the features that are most conducive to classification and the most representative of this data stream, and the final classification accuracy rate is as high as 99%. Lopez-Martin et al. [89] proposed to form a 20×6 matrix based on the port number, packet load length, packet interval time, window size and other attributes of the first 20 data packets in the data stream, and input it to the convolutional neural network (CNN) and The combined model of long and short-term memory recurrent neural network (LSTM) can achieve a final accuracy rate of over 96%. Shapira et al. [90] proposed to convert the data stream into pictures according to the packet size and packet arrival time of the one-way data stream, and then classify them through the CNN model. The final classification accuracy rate can reach 99.7%.

Malware web traffic identification
Similar to personal computers, the widespread use of mobile devices has aroused the interest of malware developers. Among the many mobile devices, smartphones are ideal targets for attackers because: (1) they are ubiquitous, that is, the number of potential targets is large; (2) they have sensitive information about the owner, such as identity, contact people, GPS location; (3) they have network capabilities, and they usually connect to the Internet. We define malware detection as trying to understand whether it is malicious by analyzing the network traffic generated by a mobile application. In mobile networks, the detection of malicious traffic usually detects apps.
In June 2004, the first known smartphone malware appeared in public view. It exists in Symbian OS [94], named cabir, and propagates through Bluetooth. In fact, between 2004 and 2007, more than 95% of malware came from Symbian OS [92]. Since then, there have been more and more Android and IOS devices. Therefore, malware against these operating systems is also emerging. In July 2008, the first autobiographical worm for iPhone was detected, named Ikee [93]. The worm only uses the installed SSH server and the default root password to attack the jail-broken iPhone, but the threat to users is low. Then, in 2009, a variant named Ikee.B was found. This is the first botnet with obvious malicious attacks. In 2010, the first malware against Android was discovered, named FakePlayer [94]. In fact, between 2012 and 2014, more than 90% of the malware detected were targeted at the Android platform [95]. As of March 31, 2015, nearly 4000 mobile families and variants have been identified for mobile devices [96]. Between 2012 and 2014, the number of malwares per quarter was about 200 [95].
Protecting intelligent devices from malware attacks is also a hot topic in research [97][98][99]. There are two main analysis directions for identifying malware: static analysis and dynamic analysis. Dynamic analysis refers to the technology of executing the sample software and verifying the behavior of the sample in practice, while static analysis will verify the software based on its source code rather than actually executing the sample. In fact, static analysis can only detect malware with unavailable signatures, which is invalid for polymorphic and deformed code. Literature [100] points out that the commonly used static analysis only detects 20.2% to 79.6% of malware. Dynamic analysis is very promising. It can use multiple behavioral characteristics to analyze samples. For example, Trojan horse always needs to call multiple system processes. Therefore, literature [101,102] proposed a method to detect Trojan horse through system behavior analysis. In addition, traffic characteristics are very useful for identifying malware spreading through the network. For instance, according to the traffic characteristics, a model is constructed in literature [103] to identify fast traffic botnet attacks. Literature [104] uses traffic characteristics to detect a new class of active worms.
XcodeGhost's servers and clients communicate with each others via Internet. Therefore, there were a lot of XcodeGhost-related traffic in our collected data. This unique vantage makes our work distinguish with the works [105,106]. We can gain some XcodeGhost features by analyzing the collected data. And it may be helpful to study XcodeGhost or to identify other malware like XcodeGhost.
Literature [107] introduced a malware detection application in the Android environment. This application can monitor multiple aspects of the device,such as memory, network, power and extract different characteristics, some of which are related to network traffic such as the number of packets received. Then, training is performed based on the collected traffic statistical feature sample data to obtain a classifier. And use the obtained classifier to check whether the installed application is malicious. The article used 40 benign and 4 malicious Android applications to evaluate the model and achieved good results.
Besides, with the development of network theory and technology, some new ideas have emerged that can tolerate network attacks [108] or resist malicious attacks on network terminals from the scratch of network design [109,110]. For example, the authors in [108] proposed the use of multiple paths to transmit data to avoid network attacks, and reference [109] proposed an smart collaborative balance scheme to dynamically adjust network functions. This scheme can effectively resist malicious attack from terminals.

Evaluation criteria for web traffic classification
At present, the evaluation of traffic identification and classification are mainly uses accuracy-related indicators. This indicator is relatively single. To meet the ever-increasing flow analysis requirements, on the basis of accuracy-related evaluation indicators, comprehensive indicators of compatibility, robustness, integrity, and directionality are introduced. The following is a detailed introduction to the evaluation indicators of network traffic identification and classification.
1) Accuracy Accuracy reflects the ability of traffic identification technology to identify network applications. Assuming that N is the number of traffic samples, m is the number of application types, and n i j represents the actual number of samples of type i applications marked as type j. True Positive (TP) represents the number of correctly labeled samples among the samples of the actual type i, T P i = n ii . False Positive (FP) represents the number of samples incorrectly identified as typeiamong samples whose actual type is not i, FP i = n ji .
By using of all parameters mentioned above, confusion matrix is a more clear way to describe classification. It can tell us how the classification model is confused when it makes predictions. The confusion matrix includes TN, FP, FN and TP. When the classification problem is two classifications, the content of the confusion matrix is shown in Table 4. According to the above analysis and Table 4, the precision is defined as follows.
False negative (FN) represents the number of samples whose actual type is i that are misidentified as other types. FN i = n i j True negative (TN) represents the number of samples marked as non-i among the samples whose actual type is non-i, T N i = n j j . The recall rate is defined as follow.
Similarly, true negative rate (TNR) represents the ratio of negative outcomes that are actually predicted to be negative. This metric is also called specificity and is defined as follows.
In Mathematics, the Geometric Mean is the average value or mean which signifies the central tendency of the set of numbers by finding the product of their values. In the field of network traffic classification, we use it to balance Sensitivity and Specificity at the same time. The definition of g-mean is shown in Eq (4.4).
The precision rate and recall rate reflect the recognition effect of the recognition method on each individual protocol category. Especially when the sample categories are unevenly distributed, recall and precision can accurately know the classification of each category. The accuracy rate reflects the overall recognition performance of the recognition method. A good algorithm should have a high accuracy rate, precision rate, and recall rate at the same time. The accuracy is defined as follow.
F-Measure is an evaluation index obtained by comprehensive precision and recall. The higher the F-Measure, the better the classification performance of the algorithm in each type.
Besides, top-k accuracy is an important evaluation index used to evaluate the classification accuracy of the k categories with the most number.

2) Completeness
The completeness reflects the recognition coverage of the recognition method. Completeness refers to the ratio of the sample identified as i to the sample of the actual type i, which is equivalent to the ratio of the precision rate to the recall rate, and the value range may exceed 1. Completeness is defined as follow.
3) Unrecognized rate The unrecognized rate reflects the ability of the traffic identification tool to identify unknown traffic types. Unrecognized rate refers to the ratio of traffic that does not belong to a known traffic type to the total traffic. F total represents the total number of bytes or streams of traffic, and F known represents the number of bytes or streams of identified traffic. unrecognized = F total − F known F total (4.8)

4) Robustness
Robustness reflects the ability of traffic identification tools to maintain high identification performance for a long time. Robustness refers to the ability of the traffic recognition technology to maintain a high recognition rate for a long time. acc k represents the accuracy rate of period k, acc 0 represents the initial accuracy. robustness = r k=1 (acc 0 − acc k ) r (4.9)

5) Compatibility
Compatibility reflects the ability of traffic identification tools to be used in different network environments. Compatibility indicates the ability of traffic identification technology to be used in different network environments. acc j represents the accuracy in the network environment j, acc represents the average accuracy in all environments. compatibility = m j=1 acc j − acc m (4.10)

6) Evaluation index
In addition, there are still some problems in the quantification of some evaluation indicators, such as real-time, directional, and computational complexity. The real-time performance reflects the ability of the traffic identification method to identify network applications online and quickly. We can identify an application in time by use of the characteristics of some data packets rather than waiting for the end of the entire flow.
The directionality reflects the ability of the flow identification method to identify different flow transmission directions. IP flow can be divided into unidirectional flow and bidirectional flow. Unidirectional flow can be divided into upstream and downstream according to the transmission direction. If the first data packet is packet loss, it is impossible to judge the upstream and downstream directions. Directionality can be embodied in unidirectional flow (upstream, downstream) or bidirectional flow.
The computational complexity reflects the overhead required by the traffic identification method to accurately identify network applications. Complex identification features consume a lot of storage space and computing power, which seriously affects the traffic analysis of the backbone network. Computational complexity can be embodied in time and space complexity.

Discussion
In summary, there are still many problems to be solved in the field of traffic classification. In the future, scientific research can be carried out from the following aspects.
1) Fine-grained unknown web traffic identification We discover that the web traffic with an unknown IP address in the host field accounted for 20.9% of the total traffic in the backbone. Identifying these unknown traffic is still a big challenge.
2) Fine-grained identification of encrypted traffic With the increasing demand for fine-grained identification of traffic, it is far from enough to identify whether the traffic is encrypted. In the actual scenario of network management,netwok operators need to identify the applications or services under the encryption protocol or tunneling protocol. To achieve the goal of fine-grained recognition, multi-stage progressive fine-grained recognition and hybrid methods are better solutions. Each stage completes different identification tasks, or combines different algorithms to identify different applications.
3) Application recognition under SSL protocol To ensure the security of communication, there are increasingly network applications using the SSL protocol. The SSL protocol is widely used in web browsing, watching videos, social networks, etc., so that the application based on the SSL protocol has become increasingly complex. The SSL protocol is impeccable in protecting user data and privacy. At the same time, the protocol also pushes the difficulty of traffic identification to a new level. How to identify network applications under the SSL protocol has become a challenge for current network management.
4) Encrypted video content information recognition As video services become increasingly widely used and the proportion of video traffic continues to increase, network operators and video service providers need to know the current quality of video experience services to improve video QoS. As the most commonly used video website, YouTube uses encryption technology for more than 90% of traffic, and increasingly video websites use encryption technology. In the scenario of encryption, it is difficult to obtain the parameters related to the quality of the video experience service, such as the playback bit rate..Therefore, how to identify the bit rate and the encrypted video is of great significance for evaluating and improving QoS. 5) Accurate marking of encrypted traffic data sets In recent years, some new algorithms and techniques with good classification performance have been proposed. However, these algorithms and techniques cannot be compared with each other for the collected network traffic is always different, most public data sets have no payload information and marking information, and even the payload of encrypted traffic is difficult to mark with DPI tools. Therefore, some researchers have to use common port numbers to add filtering rules for marking, which leads to inaccurate benchmarks. In addition, to meet the requirements of fine-grained identification of encrypted traffic, the key is to mark different applications running under the encryption protocol, making it more difficult to mark. The self-generated data set mainly adopts the method of monitoring the host kernel or the DPI method to obtain the labels. Although the self-generated data set is relatively easy to obtain the label information, the self-generated data set of each will cause the problem of incomparability between different algorithms. Therefore, it is urgent to build some labeled data sets for various traffic classification. 6) Traffic masquerading The identification method based on flow characteristics is the most widely used approach for encrypted flow identification. Therefore, the corresponding flow pattern disguising techniques, such as flow filling, flow standardization, and flow masking, are constantly being studied. Wright [111] proposed a convex optimization method for real-time modification of data packets, disguising the packet size distribution of one traffic as the packet size distribution of another traffic, and the transformed traffic can effectively avoid traffic classification such as VoIP and web recognition. In the future, traffic masquerading technology will integrate multiple methods such as traffic filling, traffic standardization, and traffic masking to deal with traffic analysis, and the diversity and adaptive capabilities of traffic masquerading will be greatly enhanced. In addition, anonymous communication, tunneling technology, and proxy technology are all different manifestations of traffic masquerading. Anonymous communication prevents tracking by hiding the identity information and the communication relationship. Tunneling technology uses L2TP and SSTP. Data packets are re-encapsulated by other protocols, and the data compression proxy technology changes the flow statistics characteristics to save traffic. Therefore, it is necessary to improve the current identification methods to cope with the upcoming challenges. 7) New protocols and changes in traffic distribution Due to the improvement and optimization of application protocols and the continuous development of new versions to hinder traffic identification, the protocol signatures and behavior characteristics are changed accordingly. Therefore, the original identification methods need to be updated periodically. As the public's demands for network security and network performance increase, new encryption protocols such as SPDY, HTTP/2.0, and QUIC are constantly being introduced to solve the bottleneck of TCP and UDP-based protocols, and achieve low latency, high reliability and security network communication. In the near future, HTTP/2.0 and QUIC protocols will be widely used, and how to identify the applications carried under the protocol faces new challenges.
In addition, the methods based on DL have been widely used in encrypted traffic classsification and have made a great progess [90,112,113]. To solve the expensive of the general DL model, The authors in [114] proposed a Incremental Learning techniques to add new classes to models without a full retraining,This techniques can save a lot of calculations as well as automatically adjust the model with the input of data. With the rise of the Internet of Things, the identification and classification of traffic in the Internet of Things is also an important research orientation in the future [115,116]. Additionally, another emerging trend in ML/Dl-based traffic classifiers is explainable AI. All these new emerging techniques will helpful to overcome the challenge of new protocols and changes in traffic distribution.

Conclusions
We present an up-to-date survey on fine-grained web traffic identification in this paper. A comprehensive overview of fine-grained web traffic identification is presented firstly. Then, we introduce the recent research work of fine-grained web traffic identification from three aspects: wired network, mobile network, and malware traffic identification. Finally, we conclude the challenges and future perspectives on the basis of our systematic survey. The detailed literature review and in-depth investigations may inspire more endeavour to further improve fine-grained web traffic identification.