Abstract

With the rapid development of Internet, especially the mobile Internet, the new applications or network attacks emerge in a high rate in recent years. More and more traffic becomes unknown due to the lack of protocol specifications about the newly emerging applications. Automatic protocol reverse engineering is a promising solution for understanding this unknown traffic and recovering its protocol specification. One challenge of protocol reverse engineering is to determine the length of protocol keywords and message fields. Existing algorithms are designed to select the longest substrings as protocol keywords, which is an empirical way to decide the length of protocol keywords. In this paper, we propose a novel approach to determine the optimal length of protocol keywords and recover message formats of Internet protocols by maximizing the likelihood probability of message segmentation and keyword selection. A hidden semi-Markov model is presented to model the protocol message format. An affinity propagation mechanism based clustering technique is introduced to determine the message type. The proposed method is applied to identify network traffic and compare the results with existing algorithm.

1. Introduction

Network protocol specifications, describing the structure of protocol messages and regulating the behaviors of communication entities on the Internet, play an important role in addressing numbers of security or management oriented issues in several domains of computer and networking. For example, intrusion detection systems and firewall systems require protocol specifications to perform deep packet inspection. Security experts spy and understand the specification of command & control (C&C) protocols [1] to detect and defend the botnets. Network management administrators build up application signatures based on protocol specifications to identify protocols and tunnels in monitored network traffic. Fuzz tests [2] make use of protocol specifications to reduce the number of fault-inserted files while still maintaining the maximum test case coverage. The protocol specifications are also powerful tools to enable the interoperation between multiple systems based on incompatible protocols [35].

A complete specification is referred to as both protocol message format and protocol state machine. The former reveals the protocol syntax which conducts the process of constructing different types of messages to be exchanged between communication entities, while the latter formulates the behaviors of protocol entities during the whole process of communication, such as the order in which different types of messages should be sent or received. For open protocols, like HTTP and FTP, their specifications can be obtained by means of accessing to the published documents. However, for proprietary protocols used by enterprises or hackers, their specifications would not be unpublished for commercial or security reasons. To date, more and more new protocols and mobile applications emerge every day due to the rapid development of mobile Internet and unprecedented popularity of smart phones [6]; network management administrators need to know about the specifications of these protocols or applications to monitor the network traffic. However, there is no public documentation about their specifications. Over the past few years, researchers deem that the only available option to spy the specification of proprietary protocol or new emerging mobile applications is protocol reverse engineering.

Traditionally, protocol reverse engineering is performed by manual analysis, which is time-consuming and error-prone. For example, the Samba project has taken over 12 years to manually recover the specification of SMB/CIFS [3]. In the Pidgin project [4], the Pidgin plug-ins have to be patched when the target protocol is changed and the delay between the protocol changes and working patches could be months, caused by reverse engineering. In order to address these problems, automatic protocol reverse engineering has been proposed over the last decade and has become a heat topic in research field of network traffic analysis.

Automatic protocol reverse engineering is a process of recovering protocol message formats and inferring protocol state machine without access to the specification of target protocol. Generally, automatic protocol reverse engineering can be divided into network trace based approach and binary analysis based approach. The network trace based approach takes captured network trace as input and reconstructs message formats by identifying basic components, such as message fields or protocol keywords, using techniques introduced from the fields of data mining, bioinformatics, nature language processing, and so on. The binary analysis based approach operates by observing how the executable binary software implementing the target protocol makes use of the memory and registers during the runtime to process the received messages or construct the sent message. The former approach is easy to deploy and relies only on the network trace generated by the target protocol, while the latter approach is useful for the scenarios where executable binary software is available and can be run in a control environment.

In this paper, we focus on recovering the message formats from network trace using the network trace based approach. Our goal is to identify the location of message fields and determine the length of protocol keywords. The message format is comprised of message fields. Some fields (called keyword fields) contain the protocol keywords. The protocol keywords are some constants or commands used by network protocol. For example, “GET”, “HTTP”, and “POST” are some protocol keywords used by HTTP protocol.

The first challenge in our research is to determine the length of protocol keywords. Previous works [712] which are based on longest common subsequence (LCS) criteria select longest frequent substrings to be protocol keywords. For example, if “G”, “E”, “T”, “GE”, “ET”, and “GET” are frequent substrings, “GET” will be chosen as the protocol keyword, since it is the longest substring. However, if the frequency threshold is low enough, “GET abc” (“abc” is a string that follows “GET”) will become a frequent string, so “GET abc” will be chosen as protocol keyword, while the true keyword “GET” would be dropped. Therefore, it is not rational to simply choose the longest frequent substrings as protocol keywords.

The second challenge is to deal with binary protocols. It is easy to define and understand the protocol keywords that bound the message fields in text protocols which restrict their content to printable ASCII characters. However, for binary protocols, fields are predefined by the protocol specifications to represent specific meanings instead of using the protocol keywords as the preambles. Messages containing only fixed-length fields are not difficult to recover. However, the complexity will increase dramatically when the fields are variable in length.

The third challenge is to determine the location relationship of message fields. The relationship of fields varies from sequence to juxtaposition. For example, in the request message of HTTP, the request method field “GET” and the HTTP version field “HTTP/1.1” are of sequential relation, which means that “GET” must occur in some location before the position of “HTTP/1.1” and the location of the two fields can not be exchanged, while some other fields, such as the “Host” field and the “Server” field, are of juxtapositional relation, which means that their locations can be exchanged with each other.

In this paper, we apply a probabilistic model, hidden semi-Markov model (HsMM) [13], to address the challenges of our work. On the one hand, one can find out the optimal length of the protocol keyword with maximal likelihood probability based on the HsMM. Obviously, the length of keyword based on maximal likelihood probability is much more reasonable and rigorous than those empiristic decisions of choosing the longest frequent substrings. On the other hand, the HsMM model is a probabilistic directed graph (lattice). Each node in the lattice represents a state that can emit various observations. The states in the same longitude are of sequential relation, while states in the same latitude are of juxtapositional relation. Therefore, it is natural to use HsMM to model the sequential and juxtapositional relation of fields.

The organization of this paper is as follows. In Section 2, related work about protocol reverse engineering is studied. In Section 3, a brief review of the concept and definition about HsMM is illustrated. In Section 4, the proposed method of modeling message format using HsMM is presented in detail. In Section 5, the system architecture is presented and some implementation issues are discussed. In Section 6, the proposed method is evaluated and the experiment results are shown. Finally, a conclusion is made in Section 7.

Over the past few years, automatic protocol reverse engineering has attracted tremendous research interest in both research and industry field of computer and networking application. Numbers of works have been published to discuss and address many issues about the heat topic. Beddoe [7] proposes to make use of algorithms widely used in the field of bioinformatics, that is, the sequence alignment algorithms and phylogeny construction algorithm, to determine the location and size of field in each individual packet. Beddoe presents his effort in the protocol informatics project and implements his approach in Python to extract the longest common subsequence (LCS) as message fields with constant value. Kreibich and Crowcroft [8] introduce a novel variant of the Jacobson-Vo algorithm [14] to compute the LCSs of input strings and employ a flexible gap-minimising algorithm to improve the efficiency and effectiveness of network traffic alignment. The authors show that their method outperforms the commonly used Smith-Waterman approach on a wide range of network protocols. Both Beddoe [7] and Kreibich and Crowcroft [8] aim to mine the commonalities of messages as the basic components of message formats based on LCS, while our approach is to infer the location and length of message fields based on the maximal likelihood probability.

Cui et al. present Discoverer [15] to recursively cluster and align the token patterns of messages to infer protocol message format idioms. Although Discoverer is practicable to recover the protocol message formats of three selected protocols, that is, HTTP, RPC, and SMB/CIFS, there are still about 10% of the message formats that could not be correctly inferred due to some inaccurate parsing. Discoverer firstly tokenizes the protocol messages and initially clusters messages according to the token patterns. Thus, the lengths of fields are factitiously forced to be consistent with the size of tokens and the boundaries of message fields in the text protocols are restricted to some separators (such as space) specified by the authors. Moreover, the relationship of fields in message formats inferred by Discoverer is sequential. In our approach, we do not make any assumption about the separators and aim to infer the optimal length of fields by maximizing the likelihood probability of message segmentation. Meanwhile, we capture the location relationship of fields, such as sequential and juxtapositional relation, by learning a probabilistic directed lattice graph.

Wang et al. [16] present a framework to infer message formats by improving the Aho-Corasick (AC) algorithm [17] to identify frequent sequences and mining the association rules among the frequent sequences. They evaluate the framework in wireless environment and show that the framework can identify ARP and ICMP packets in high accuracy. However, their framework only searches for association rules of some frequent fields in protocol messages, while the aim of our scheme is to infer the whole format of message by inferring all of the message fields.

Wang et al. propose Biprominer [18] to extract binary protocol message formats based on the statistical nature of message formats. Firstly, the Biprominer recursively learns and labels frequent patterns in the message based on the frequency of blocks (comprised of several bytes). Then, the messages with labeled blocks are converted into a transition probability model. Antunes and Neves [19] present building an automaton based on sequence alignment algorithm for recovering message formats from network trace. They firstly extend the partial order alignment algorithm to generate an initial automaton from messages, then apply sequence alignment techniques to find out the optimal alignment between the automaton and the new coming messages, and finally use the alignment results to further extend the automaton. These researches focus on modeling the transition probability of message blocks or finding out the acceptable paths of bytes in the automatons, while our work aims to identify message fields with variable length as well as model the location relationship of fields.

Some works leverage the semantics analysis of message fields to infer message formats. The so-called semantics analysis is to identify the keyword sequences, each of which indicates a specific intention of the protocol message. Krueger et al. [20] present a semantics-aware tool for network payloads analysis to automatically extract semantics-aware components from captured network trace. They map protocol messages to a vector space based on tokens or words and identify communication templates corresponding to the base directions in the vector space. Wang et al. propose ProDecoder [21] to reconstruct the message formats based on semantics-aware approach. ProDecoder first identifies keywords using Latent Dirichlet Allocation (LDA) models taken from natural language processing. Protocol messages are then clustered according to their semantics (different combination of keywords) using the Information Bottleneck clustering algorithm. Finally, messages in each cluster are aligned to find out the common parts among them using well-known sequence alignment algorithms. These methods aim to reveal the semantics characteristics of protocol messages under specific communication motivations, so the message formats are expected to be affected by the user intentions. However, our method captures the general structures of messages of the target protocol.

As an alternative approach to understand the unknown or proprietary protocols, binary analysis based techniques also draw much research attention in the field of network security. For example, Polyglot [22], Tupni [23], AutoFormat [24], Prospex [25], and Dispatcher [26] are all systems based on binary analysis techniques. They are workable and applicable in the scenarios where the binary software is available and can be run in a controlled environment. In addition, binary analysis techniques can not work when the binary clients apply some interference techniques, such as obfuscation, to protect themselves from being detected and reverse-engineered. In this paper, we narrow our research into the application scene that only the network trace of target protocols is available. Hence, we do not discuss these binary analysis based techniques but focus on those approaches based on network trace.

3. Hidden Semi-Markov Models

A hidden semi-Markov model (HsMM) as shown in Figure 1 is an extension of hidden Markov model (HMM) by allowing the underlying process to be a semi-Markov chain with a variable duration time for each state [13, 27].

The basic elements of HsMM include the hidden state set the state duration set and the observation set

The hidden state of underlying process at time is donated as . The symbols and are used to represent substantive values of state variable . For simplicity of notation, we denote the following:(i) means ; however, the previous state and the next state may or may not be .(ii) means ; however, neither nor is .(iii) means and ; however, the previous state may or may not be .(iv) means and ; however, the next state may or may not be .

As shown in Figure 1, the observation sequence is the observable process, while the state sequence and the state transitions , , are underlying process that cannot be observed. For each pair in the underlying process, is the time duration of state .

Formally, a HsMM can be represented by where is the state transition probability matrix, is the emission probability matrix, is the distribution of state durations, and is the initial distribution of states.

The state transition probability matrix is defined as where , subject to and zero self-transition probabilities , for all .

The emission probability matrix is defined as where means that is observed at in state .

The distribution of the state duration is

The initial distribution of states indicates the probability of the initial state before time ; that is,

4. Protocol Modeling

4.1. Modeling Network Protocol Using HsMM

A network protocol is a set of rules for regulating the exchange of messages in the Internet. The specification of network protocol describes the strict syntactical format for valid message and defines the strict procedure rules of data exchange. The alphabet of valid messages is the set of all possible values of a single byte; that is,

A string over is defined as a finite sequence of letters in ; that is, , (). The set of all finite strings over alphabet is represented as .

The protocol message, denoted as , is defined as the basic data unit exchanged between different communicating entities of application-layer protocol. A message consists of a set of message fields, including keyword fields and data fields, as shown in Figure 2. The message fields, denoted as , are strings over ; that is, .

The valid messages exchanged by communicating entities are constructed according to the protocol message format. The relationship of field location in the message format is varying from sequential to juxtapositional. For example, according to the HTTP specification, message fields , , and in Figure 2 are of sequential relation; that is, the location of must go after but preceded . However, the relation of fields and is juxtapositional that means the location of and can be exchanged with each other.

In order to model message format using HsMM, protocol message is treated as an observation sequence representing the observable process. Each field is a block of observations associated with a specific hidden state with the length of this field as the corresponding state duration. For example in Figure 3, is the block of observations from to associated with state and duration . In this model, the emission probability matrix implies the relationship between observations and hidden states, while the state transition probability matrix implies the relationship of field location.

Let be an observation sequence and let be the set of frequent strings that occurred in . Given , we denote that , if is the substring of . The string is closed in , if there does not exist a string to satisfy . The set of closed frequent strings in is denoted as .

Each closed string in is associated with different hidden states; thus, the number of hidden states for closed string in is . Suppose that is associated with state ; then all characters in are observations of state .

Additionally, we define other special states () which are associated with any characters in .

4.2. Parameters Reestimation

In this section, we discuss an iterative procedure for reestimating the parameters of , based on the Baum-Welch method [28]. At the beginning, a random initialization of and is selected, while the initialization of and is processed as follows.

For , , if , where is the closed frequent string associated with . Otherwise, , if does not occur in .

For , the emission probability of letter in state is .

For and ,

In the forward-backward procedure, the forward variable is defined as where is the remaining time of the current state .

Initially, .

The inductive solution for when is given as follows:

The backward variable is defined as

Initially, .

The inductive solution for when is given as follows:

We define the probability that the state ends at time , while the state starts at time , given the entire observation sequence , as follows:

The probability that the state ends at time with its duration being , given the entire observation sequence , is defined as

The probability that the state at time is , given the entire observation sequence , is defined as

In order to solve for , we consider the following identities:

Thus, we have a recursive formula for as follows:

In the phase of recursively computing , the initial condition is given as follows:

With these notations, the parameters of can be updated and improved by the following equations:

Note that , if is true. Otherwise , if is not true.

4.3. Inferring Protocol Keywords

Given the reestimated HsMM and an observation sequence , the forward and backward variables can be computed based on forward-backward algorithm. Then, the variable can be computed using (16). In what follows, we can infer the state sequence with maximal likelihood probability based on the Viterbi algorithm [29]. The inference procedure is given as follows:

The iteration proceeds until . Thus, the observation is divided into a sequence of fields with the th field to be . is referred to as the state of . If , is a protocol keyword with the corresponding field as keyword field. If , then is a data string and the corresponding field is a data field.

4.4. Inferring Message Type

In this section, we present an algorithm to determine the type of protocol messages. The messages which belong to the same type have similar formats with each other. Thus, the type of protocol messages can be determined using clustering method according to the similarities between their message formats.

In this paper, we apply an unsupervised clustering algorithm proposed by Frey and Dueck [30] to solve the problem. The algorithm based on the affinity propagation mechanism takes the similarity matrix of data points as input and recursively selects representative exemplars for each point. Each of the selected exemplars represents a data type, while the type of other data points is determined by the exemplars they select. The number of clusters need not be specified beforehand. The similarity metric need not be defined strictly in a continuous space and does not have to satisfy the symmetric and the triangle inequality. Therefore, we can define the similarity in any reasonable way.

Before the further discussion about the message clustering algorithm, we define some basic notations. Suppose the universal set of protocol keywords is denoted as and the set of protocol keywords that occurred in message is denoted as . Given a protocol keyword of message , the cost of encoding in using the keyword set of message using as the code book is defined as

The similarity of to is defined as the minus summation of cost of encoding all keywords in using as code book is defined as

The affinity propagation algorithm exchanges two kinds of information between data points during the clustering process: responsibility () and availability (). The “responsibility” , sent from an ordinary data point to the candidate exemplar point , reflects the accumulated evidence for how well-suited point is to serve as the exemplar for point , taking into account other potential exemplars for point . The “availability” , sent from candidate exemplar point to point , reflects the accumulated evidence for how appropriate it would be for point to choose point as its exemplar, taking into account the support from other points that point should be an exemplar.

In this paper, we treat each message as a data point, and the responsibility and availability are updated according to the following equations:

Specially, is updated by

The affinity propagation algorithm clusters messages into subclusters, each of which represents a type of messages. The results of message type inference are important for constructing protocol state machine which will be discussed in our future work.

5. System Implementation

In this section, we will illustrate an overview of our system architecture and discuss some implementation issues which have to be addressed when one implements the proposed approach.

5.1. System Overview

A brief view of our system architecture is shown in Figure 4. Training data set is raw traffic captured from real world using a well-known network traffic analysis tool called tshark [31].

Since well-known protocols, such as HTTP, are well studied and described in public documents, almost all of pop analyzer tools of network traffic embed and identify well these protocols, so the true ground of well-known protocols is easy to be obtained. As a result, we consider some well-known protocols to validate and evaluate our approach in this paper and assume that the training data set is generated by only one protocol.

In the session reconstruction phase, we reconstruct the sessions according to the -tuple, that is, transport protocol, source transport number, destination transport number, source IP address, and destination IP address. For TCP-based protocol, a session starts at the packet with the SYN flag in TCP header and finishes when the FIN flag is acknowledged. For UDP protocol, a session is defined as the packets shared the same -tuple.

In the message reassembling phase, messages of TCP-based protocols are reassembled from packets according to the TCP sequence number and acknowledgement number while the messages of UDP-based protocols are reassembled according to the arrival time stamp of packets and the transmission direction of packets.

In the HsMM modeling step, an algorithm based on the Baum-Welch method is performed to reestimate the parameters of the HsMM-based protocol model. The reestimated HsMM model produced by this step implies the message format.

In the message segmentation phase, the reestimated HsMM model is applied to determine the optimal length of protocol keywords and divide message into field sequence.

In the step of message type inference, protocol messages are clustered using the affinity propagation mechanism and each cluster represents a type of messages.

5.2. Extracting Closed Frequent Strings

Suppose that is a frequent string set. If and there do not exist satisfying that is the substring of , then is a closed frequent string in . In this section, the Apriori algorithm [32] widely used in data mining field is introduced and modified to address the problem of mining closed frequent strings as shown in Algorithm 1.

(1)   Input: observation , frequency threshold
(2)   Output: closed frequent string set
(3)  # Find out the frequent strings
(4)   Initialization: frequent candidate set ,
(5)   while    do
(6)   # Check frequency of strings in
(7)  for    do
(8)    # is the frequency of in
(9)    if    then
(10)    Delete from
(11)     end if
(12)   end for
(13)   # Generate new candidate set
(14)   for    do
(15)   if    then
(16)    Create a new string
(17)    Let ,
(18)    Add to
(19)   end if
(20)end for
(21);
(22)end while
(23)# Find out the closed frequent strings
(24)Initialization:
(25)while    do
(26)for    do
(27)   for    do
(28)    # delete the substrings
(29)    if    then
(30)     Delete from
(31)      Break
(32)    end if
(33)   end for
(34)end for
(35)  Update
(36) 
(37)end while
(38) Update

The frequent string candidate set is initialized as , each element in which represents a one-byte character (line (4)). Note that the length of each element in is . The frequencies of elements in are checked and the ones whose frequencies are less than the frequency threshold would be deleted from (lines (6)~(12)). The candidates of frequent strings with length of are generated in lines (14)~(20), where the notation represents the byte sequence from the first byte to th byte in . If and the first characters of are equal to the last characters of , then the two strings can be combined into a new string by merging their overlap; that is, , and . Lines (24)~(38) aim to find out the closed frequent strings by deleting any strings in if and only if they are the substrings of some elements in .

5.3. Underflow Problem

The joint probabilities of observation sequence often decay exponentially as the sequence length increases, which leads to a severe underflow problem when the forward-backward algorithms are implemented in a real computer. To the best of our knowledge, there are three approaches to solve this problem.

Firstly, one can implement the forward-backward algorithm in the logarithmic domain to avoid the underflow problem [33].

Secondly, one can also refine the forward-backward algorithm based on the notion of posterior probabilities to make the HsMM robust against the underflow problem. The refined forward-backward algorithms replace the joint probabilities with conditional ones and automatically avoid the underflow problem without increasing the computational complexity. More information about the posterior probabilities and refined HsMM based on conditional joint probabilities can be found in the work by Yu [13].

Thirdly, the forward-backward probabilities are adjusted by multiplying a scaling factor whenever an underflow is likely to occur [27, 34, 35]. In this paper, we tackle the underflow problem of HsMM based on this scaling method. In each , we first compute based on the procedure of (12) and then compute the scaling factor in time , denoted as , as follows: where is the number of states in the HsMM.

For the term in the backward algorithm, we use the same scaling factors for each time as we used for in the forward algorithm; that is,

As stated by Rabiner [27], the scaling factors will not affect the transition variable , initial state probability distribution , and the observation matrix . However, the procedure for computing is changed as follows:

In order to avoid the underflow problem, we prefer to calculate the logarithmic form of :

6. Evaluation

In this section, we evaluate the proposed approach on data sets captured from the Internet entrance of our department on 23 December 2013. The data set contains network trace generated by six protocols, including two text-based protocols (HTTP and SSDP) and four binary-based protocols (BitTorrent, QQ, DNS, and NetBIOS).

Existing algorithms such as PI (protocol informatics) and Discoverer are also applied to analyze the same data set. The PI project has released an open source Python code for researchers in the project home page [7], so we apply the code and perform it to analyze the data set. The Discoverer system is implemented according to the work presented by Cui et al. and the parameters are set as reported in their previous work [15].

6.1. Protocol Keyword Extraction

Since there is no information about protocol keywords of binary protocols in published protocol specifications, we only evaluate protocol keyword extraction for text-based protocols (i.e., HTTP and SSDP) in this section. We use the metrics of recall and precision to evaluate the quality of keyword extraction. The definition of these metrics is presented in the following:(i)Recall: the recall rate is defined as the ratio from the number of true positives of inferred keywords to the total number of keywords in the data set.(ii)Precision: the precision rate is defined as the ratio from the number of true positives of inferred keywords to the total number of inferred keywords.

We randomly select 100 connections of each protocol and only consider the first 1460 bytes (it is long enough to contain the headers of protocol messages) of each message. The results of protocol keyword extraction are shown in Table 1, where “Discv” represents Discoverer system and “PI” represents PI project. The column of “true keyword” records the true number of protocol keywords that occurred in the trace, while the column of “inferred keyword” records the number of inferred keywords. Compared with Discoverer and PI project, HsMM-based method has a higher true positive, precision, and recall rate. We found that Discoverer infers too many keywords, while PI project infers too little.

Actually, there are far more protocol keywords inferred by our approach than the true keywords. Most of them are frequent and indispensable in the protocol messages, such as some parameters used frequently. So, all of these strings are also treated as protocol keywords and they play important role in inferring message formats and analyzing protocol state machine.

We also note that it has been found that the proposed HsMM-based approach can not only extract frequent keywords but also extract some keywords whose occurrence frequency is low.

6.2. Protocol Message Format Inference

We illustrate the results analyzed by PI in Figure 5. The message formats are inferred as the longest common substrings of protocol messages. As shown in Figure 5, only a few protocol keywords (such as “GET”) and fields are inferred by PI, so PI does not seem to be expert in generating effective message formats.

As shown in Tables 24, we present the results of HTTP protocol for Discoverer, PI, and HsMM in a similar form to make it more clear for the readers. Discoverer infers message format based on token sequence and determines the attribute of token, such as constant token or variable token. Far more protocol keywords (such as “HTTP/1.1” and “Host:”) are inferred by Discoverer than PI. However, some frequent strings (e.g., “ocspd” and “x86_64”) which are not protocol keywords are also inferred as keywords.

In this paper, the proposed approach embeds the message formats into a HsMM-based protocol model. For each protocol, we train a HsMM by recursively reestimating the model parameters, including initial state probability, state transition probability matrix, and observation probability matrix. Using the HsMM, the optimal lengths of protocol keywords are determined and optimal segmentation of protocol message is inferred based on the maximal likelihood probability.

The parameter of a trained HsMM-based HTTP model is shown in Figure 6. In this model, the number of states is assigned to . As shown in Figure 6(c), the observations are mainly distributed in the area of for each state, while the probabilities of observations located in are much smaller. We can also find that the duration of each state is mainly distributed between and , which means that the lengths of most fields in the message vary from to .

When the HsMM is used to analyze an observation sequence (such as protocol message or network flow), a lattice, as shown in Figure 7, is constructed to compute the forward variable based on the model parameters. The state at each time may be in one of states in , while each state may emit multiple observations (characters varying from to ) with different probabilities, so HsMM could reveal the characteristics of both sequential fields and juxtapositional fields and such lattice implies the message formats. When forward variable computing is finished, the Viterbi algorithm can be applied to infer an optimal path which leads to a message field sequence with maximal likelihood probability. An example of inferring message fields based on our approach is illustrated in Figure 8. In this illustration, protocol keywords are labeled with states less than , while other fields are labeled with states between and . Some - pairs are found in the message, such as the -fields labeled with state (in green) and -fields labeled with state (in light blue). IP address and some number sequence are labeled with state . We also found that the carriage return line feed (i.e., “0D0A” in hex) is labeled with the state of .

6.3. Message Type Inference

The results of message type inference are important for the future work of constructing protocol state machine. In our experiment, one type of messages will be clustered into serval types. However, we can treat them as serval different types since each inferred type represents a cluster of messages which share the same characteristic and have similar message format.

In order to compute the accuracy of message type inference, we label each cluster as the true type which dominates the cluster. The accuracy of message type inference is shown in Table 5.

6.4. Traffic Identification

The proposed technique can be used for network traffic identification in the application of network management or network monitoring. Suppose is the set of learned HsMM-based protocol models, where is the HsMM-based model of protocol . For each session , the class of , denoted as , can be inferred as follows: where is the likelihood of given . can be computed by (30).

This scenario is related to previous work by Ma et al. [36] in the fields of traffic identification based on application-layer payload. Ma et al. build Markov models from the application-layer payload and apply them to identifying network traffic. In this paper, we also implement the Markov model as stated in [36] and compare their results with ours, as shown in Figure 9. The results show that the proposed method outperforms the Markov based method in the field of traffic identification.

7. Conclusion

The protocol keywords and message fields are inferred based on hidden semi-Markov model by maximizing the likelihood probability of message segmentation. The segmentation of messages reveals some semantic information about the field, such as keyword, IP address, and - pair. The proposed technique is shown to be applied to the field of network traffic identification and outperforms existing algorithm.

The proposed HsMM-based protocol message format can be applied to field of intrusion detection or anomaly detection. One can use the HsMM-based message format of normal traffic to calculate the average likelihood probability of the new coming traffic and check whether the average likelihood probability is deviated from a normal level. Our method can also be applicable for traffic identification, fuzz test, vulnerability discovery, and so on.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work was supported by The National Natural Science Foundation of China (61571141); Guangdong Natural Science Foundation (2014A030313637); Guangdong Provincial Department of Education Innovation Project (2014KTSCX149); The Excellent Young Teachers in Universities in Guangdong (YQ2015105); Guangdong Provincial Application-Oriented Technical Research and Development Special Fund Project (2015B010131017).