EDSM-Based Binary Protocol State Machine Reversing

: Internet communication protocols define the behavior rules of network components when they communicate with each other. With the continuous development of network technologies, many private or unknown network protocols are emerging in endlessly various network environments. Herein, relevant protocol specifications become difficult or unavailable to translate in many situations such as network security management and intrusion detection. Although protocol reverse engineering is being investigated in recent years to perform reverse analysis on the specifications of unknown protocols, most existing methods have proven to be time-consuming with limited efficiency, especially when applied on unknown protocol state machines. This paper proposes a state merging algorithm based on EDSM (Evidence-Driven State Merging) to infer the transition rules of unknown protocols in form of state machines with high efficiency. Compared with another classical state machine inferring method based on Exbar algorithm, the experiment results demonstrate that our proposed method could run faster, especially when deal-ing with massive training data sets. In addition, this method can also make the state machines have higher similarities with the reference state machines constructed from public specifications.

In the past years, traditional methods have proven to be usually time-consuming and to contain many errors. Open-source project Samba spent 12 years on manually mining Microsoft's SMB (Server Message Block) protocol specification and implementing cross-platform file and print sharing mechanisms. Afterwards, many automatic protocol reverses analyzing methods were proposed to improve the PRE's efficiency and accuracy, just like the aforementioned research work [18][19][20][21][22][23][24]. When reconstructing the state machine of an unknown protocol communication process, however, there has still been many problems with computation efficiency including state explosion and time cost [25,26]. This paper proposes a state merging algorithm based on EDSM [27] to automatically reconstruct the state machine of unknown binary protocols. Firstly, in order to transform protocol message exchange sessions from the form of message sequences into message type sequences, training samples pre-processing is carried out. Secondly, we use a heuristic state labeling algorithm to assign different labels to the protocol transition states. Finally, a state merging algorithm based on EDSM is proposed to complete the state machine reverse process to achieve the final state merging result as a general minimal DFA. This paper is organized as follows. We give formalized description of the protocol state machine and other definitions in Section 2, and explain the detailed theoretical design of this method in Section 3. In Section 4, several binary protocols are tested by our algorithm. The results demonstrate that the inferred protocol state machines have superior similarities with the reference state machines constructed by public specifications and have better processing capabilities aimed at large amounts of training data.

Problem Definition
Protocols regulate the transmission processes among network entities by defining message syntax and semantics as well as their exchange orders. A network protocol consists of three elements: (1) Syntax, (2) Semantic, and (3) Timing. Syntax refers to the message format and encoding. Semantic refers to the meaning of each field in a message. Timing rules the constraints and requirements of communication sequences and state transitions between protocol entities. Specifically, network entities generate responding messages and send them to each other in process of communication according to the semantic information that exists in message format specification and protocol state machines.
Bhargavan et al. [28] formalized the problem of message interaction between network entities as a problem of language identification. Inspired by their work, we give a formalized description of the problem of protocol specification mining, as shown below.

Definition 1:
A Protocol State Machine can be represented as a "six-tuple" FSM (Finite State Machine), as shown in Eq. (1): where Q = {q 0 , q 1 , . . . , q k } is a finite set of states; I = {i 1 , i 2 , . . . , i m } is an input set of message formats; O = {o 1 , o 2 , . . . , o n } is an output set of message formats; δ : Q × I → Q is the state transition function; λ : Q × I → O is the output function; q 0 is the initial state.
In particular, there are two differences between the FSM in Definition 1 and the classical automatic mechanism. Firstly, for the FSM in Definition 1, all states are acceptable without rejection states. Secondly, any prefix of a valid message sequence is acceptable, and any prefix of an invalid message sequence is acceptable except the last message.

Definition 2:
The communication between protocol entity D 1 and protocol entity D 2 is composed of a series of message sequences T 1 , . . . , T n . Each message sequence T i = {p 1 p 2 . . . p |T i | } is composed of a certain number of messages, where p i refers to the i th message, |T i | refers to the number of messages in T i .
According to Definition 2, we divide the PRE-process into two steps based on the scope of protocol specification: (1) message format and semantic mining, and (2) protocol state machine reverse. Besides, the first step is the basic of the second step. In this paper, we mainly focus on protocol state machine reverse.

The Proposed Protocol State Machine Reverse Method
In last section, we give the formalized description of the protocol state machines and explain the relationship between the message format, semantic mining and protocol state machine reverse.
In this section, we introduce the theoretical method design of our proposal. In Section 3.1, we describe how to pre-process the initial message of binary protocols in detail. In Section 3.2, the state labeling algorithm is described to assign different labels on states. Finally, the EDSMbased state merging algorithm is introduced to rebuild final protocol state machines of the target unknown protocol.

Pre-Processing of Messages
As the initial messages cannot be used as elements of the input set in Definition 1, we preprocess training samples to find state relevant fields to transform protocol sessions from the form of message sequences into the form of message type sequences. The main idea is using multiple sequence alignments to identify the variable length fields in messages, and then extracting state relevant fields by analyzing statistical characteristics of each field after removing variable length fields.

Identification of Variable Length Fields
Binary protocol messages are formed by a series of bytes, which usually consist of message headers and message bodies. The message header includes the message type, length and other fields. The length of a message header is usually fixed. However, in some complex binary protocols, there may be variable length fields, such as parameter fields. Therefore, we use multiple sequences alignment [29] to identify these variable length fields.
Multiple sequence alignment is essentially an NP-complete problem, which is always realized by heuristic algorithms. Here, we use a progressive multiple sequence alignment algorithm to solve the multiple message field segmentation problem. To implement a multiple sequence alignment, we create a phylogenetic tree to guide this process.
Once the phylogenetic tree is completed, a progressive multiple sequences algorithm can be achieved under the guidance of this phylogenetic tree. In fact, the basic operation of the progressive multiple sequence algorithm is global sequence alignment. In this paper, we apply the Needleman-Wunsch algorithm [30] to complete the global sequence alignment. The score matrix of messages A and B in the Needleman-Wunsch algorithm can be built based on the following equation.
In order to ensure that the fixed fields do not introduce gaps, and variable length fields could be aligned by adding gaps, we present a new substitution matrix for Needleman-Wunsch algorithm: initialize a match score of 2, mismatch score of −1, gap penalty of −3. The gap penalty is a special treatment in the algorithm running process: continuous gaps are counted only once, and gap penalty plus −1 for the next time.
The advantage of our method is that once gaps are introduced between fixed fields, the remaining fields will not be aligned to result in serious gap penalties. In addition, variable length fields may need to be introduced more than one gap to be aligned. In order to avoid excessive gap penalties, we only count continuous gaps once.

Extraction of State Relevant Fields
There are various fields in binary message formats with different constraints added by protocol specification, which leads to different statistical characteristics of fields. Thus, we compute and analyze their statistical features, and then find out the features we are interested in. Based on the "Variance of the Distribution of the Variances" [31] of fields in each message format, relevant features are filtered for subsequent analysis. In this paper, we make the following assumptions to find the fields related to the state.

Assumption 1. Binary protocols have the following properties:
(1) The specific logic underlies in different traffic flows of a certain protocol to ensure the stable running of transmissions; (2) State relevant fields in messages assign the common logic of protocols; (3) The value of state relevant fields is usually limited and not excessive; (4) The value distributions of state relevant fields are similar in each session; (5) The length of a state relevant field is 1 byte which can represent 256 message types.
We define the byte as the basic field unit. Based on these assumptions, we can find that there exists a certain change pattern of each byte field in the flow of a protocol. Relevant byte fields in different packets are presented in Fig. 1. The distribution of one field in different flows is similar.
What we are interested in is the low degree of variability in this statistic.

EDSM-Based Protocol State Machine Reverse
We determine state relevant fields by preprocessing. At the end of the preprocessing, each session is denoted as a sequence S i = (t 1 , . . . , t n ), and t 1 , t 2 , . . . , t n represents the message type set. In this process of state machine inference, we are aiming to achieve an acceptor machine which can recognize the target protocol in its valid sessions by analyzing the sequence of message types.

State Labeling Algorithm
Here, we use an Augmented Prefix Tree Acceptor (APTA [32]) T to build the initial state machine. In the training set, all states are assigned by "accept". Now, we cannot leverage the existing state merging algorithm to merge pairs of states directly, which would result in an overgeneralized DFA with only one single state. Therefore, we introduce and optimize a state labeling algorithm proposed in [24] to assign different labels to the states of T. In the following, we first introduce the state labeling algorithm, and then optimize it to complete state labeling.
There is a convention in network protocols that a sequence of messages must be sent before the server can execute certain actions. So, a regular expression shown in Eq. (3) is used to represent the prerequisite of a certain message.
where r, a 1 , . . . , a j are message types.
A prerequisite means the server in a state that can accept message of type m must receive another type r firstly, and follows with (a 1 | . . . |a j ) * , optionally.
The limitation of an algorithm for calculating r is that more than one value of r may be obtained after the calculation, but not each value is reasonable. For example, in SMB, the "OPEN" operation must be performed before the "WRITE" operation, and the "TREE CONN" operation must be performed before the "OPEN" operation. Therefore, we can conclude that the "OPEN" operation is likely to rely on the "TREE CONN" caused by the dependent transition.
Therefore, for each value r i , we test whether r i is the last appeared after other values in each protocol session. If such a value is found, we consider it as the final value r that we required. And if we cannot find one, the value of r is null which means no message type is required before the server can accept the message type M.
Once all prerequisites are computed, the state q of T will be labelled within the set of message types, which can serve as an input. A serious problem of the state labeling algorithm is that many message types have the same prerequisites, which results in many states having the same label, but not all of them actually should be merged. For example, in SMB, the state created by "OPEN"' operation should not be merged with the state created by the "CREATEDIR" operation. The reason is that the "WRITE" and "READ" operations have relied on the "OPEN" operation, but no operation relies on the "CREATEDIR" operation. So, we label the state "s" by a tuple: where A is an acceptable message type of s, R is a set of message types that require s, and O is the last two elements of the path from the root to s.
The expanded state labeling algorithm is shown as follows.

EDSM-Based State Merging Algorithm
An essential operation of protocol state machine reverse is the similar states merging. According to the state tree (APTA [32]) labelled by heuristics, now we can go one step further to infer an optimal DFA by merging similar states. In the field of grammar inference, obtaining the smallest DFA consistent with the labelled training set has been proved to be an NP-complete problem by Gold, and there are many exact or approximate algorithms to solve this problem.
In fact, the complexity of protocol state machine inference is always higher than general grammar inference, because the amount of protocol messages is larger and the protocol logic is more complex. Therefore, in this paper, we adopt an approximate algorithm, EDSM, to deal with the state merging of unknown protocol APTAs. The performances of EDSM [33] and Exbar are compared and analyzed in Section 4.3.
The EDSM algorithm is achieved in the red-blue frame, which is a directed graph with the following properties: (1) All nodes in the graph are labelled with red, blue or unmarked; (2) The initial root node is marked with red, and its children are marked with blue; (3) Each red node's non-red children are marked with blue; (4) Each unmarked blue node is the root of a tree; EDSM is based on a greedy strategy. It will calculate all scores of red and blue node pairs using state labels. If there exists a blue node that cannot be merged with any other red nodes, we promote it to be red. The red node and blue node with the highest score will be merged. In the process of protocol state machine reversal, we use the Breadth-First Search (BFS) strategy to modify the EDSM algorithm in order to complete the state merging of states in same depth with same behavior. The modified EDSM algorithm is shown in Algorithm 2.

Algorithm 2: EDSM-BFS
Input: red nodes of T blue_node = , target_depth = 0, merges = ϕ, promotable_blue = ϕ Traverse T with breadth first search if current node n is blue node target_depth = n.depth break Traverse T with breadth first search if current node n is blue node and n.depth <= target_depth + 1 add n to blue_node merges = sort_by_score (filter (successful, (map (try merging and then undo, cross produce (red nodes, blue nodes)))) promotable_blue = filter (appears in no merge, blue nodes) if promotable_blue! = EDSM-BFS (cons (min-depth (promotable_blue), red nodes))) else if merges != try merging (first (merges)) EDSM-BFS (red nodes) There are two places reflecting the idea of breadth-first search. The first is the selection of blue nodes to generate the candidate state merging set; and the second is the selection of promotable blue nodes. The advantage of using breadth-first search is that it can search similar states in different paths preferentially to avoid generating too many branches in the final state machine.
In order to obtain an efficient and reliable result, the focus of our algorithm is to design a reasonable scoring mechanism. In this paper, we use the state labeling method discussed in Section 3.1 to solve this problem. The scoring algorithm is shown in Algorithm 3.

Experiment Results and Analysis
We test our implementation of the proposed method on a number of stateful binary protocols (including transport layer protocol TCP (Transmission Control Protocol) and two application layer protocols SMB and DHCP (Dynamic Host Configuration Protocol)). Since the completeness and validity of the training set have a crucial impact on the quality of the state machine, and the inferred FSM cannot identify the packet not included in the training set, we try to collect as many complete protocol sessions as possible.

State Machine Inference
In this section, we apply our method to one transport layer protocol (TCP) and two application layer protocols (SMB, DHCP). It creates state machines ranged from 4 to 12 states for each protocol.

TCP.
It is an important transport layer protocol which is object oriented, reliable and based on stream of bytes. In our experiments, we connect our terminal with a router, and run Wireshark (a well-known network packet analysis software) to sniff TCP network traffic. Then, the collected traces are fed to the implemented system, and the obtained state machine is shown in Fig. 2. The three-way handshake of TCP related to State 2 is obviously visible as well as the four waves of TCP leading to State 5. The network data transmit between server and client in State 2. Besides, we can see the "RST" and "RST-ACK" packets in State 4 which are always associated with abnormal connections in the real world.

SMB.
As an instance of a relatively elaborate, stateful, binary protocol, we chose SMB to test this method. In the experiment, version 4.1.14 of the Samba software suite has been adopted to tracked the SMB daemon when this client is used to look through directories, carry out typical operations like reading, writing, and deleting directories and files. In this way, a set of 445 recorded sessions is collected. Fig. 4 shows the protocol state machine inferred from this SMB data set.
In Fig. 3, we can see the obvious login sequence leading to State 3. When the "DFS" option is enabled, the client first attaches the "IPC$" share to achieve a "DFS" referral of the requested share. Otherwise, the client immediately enters the requested share of State 6, where most file system operations (including opening, reading, writing, or closing) are available.  Figure 3: Result of SMB state machine DHCP. Dynamic host configuration protocol is an important network protocol in the local area network which mainly automatically assigns IP addresses for Internet service providers. In our experiment, we have configured a DHCP server in our local area network and traced the DHCP traffic. 168 DHCP sessions are collected. Fig. 4 shows the DHCP result state machine.
There are two login sequences (0 −> 1 −> 2 and 0 −> 2) leading to State 2 in DHCP state machine. Path 0 −> 1 −> 2 happens when the client logins the server for the first time, and path 0 −> 2 occurs when the client reboots.

Quality of Protocol State Machine
To test the performance of this system, we use the soundness and completeness to evaluate the quality of the state machines inferred from our implementation.

State Machine Completeness
Generally, we consider a state machine is complete enough if it accepts all valid sessions, and we use the recall rate to measure the completeness of that state machine. There are two methods to obtain test samples: by net trace which reflects the overall completeness of the state machine, or by constructing a reference state machine using protocol specifications which reflects the structural completeness of the state machine. Tabs. 1 and 2 show the results of each method where "IM-" refers to the relative inferred state machines and "RM-" the reference state machines.  From Tabs. 1 and 2, we can see that the recalls of both overall completeness and structural completeness are high enough closing to 100% which demonstrate that our method is useful to accept valid protocol sessions. Besides, overall completeness is a little higher than structural completeness. It is understandable that the test samples produced by the reference state machines are more complete than those by net-traces, and the test samples we used to construct the state machine are collected by net-traces.

State Machine Soundness
We consider a state machine is sound enough if it rejects all invalid protocol sessions, and we use accuracy to measure the soundness of the state machine. The result is shown in Tab. 3, where we can conclude that the state machine we inferred are not over-generalized, which means they can reject invalid sessions.

Comparative Evaluation
With merely positive examples, there exist other approaches to implement the automation inferring mission. One popular approach is the Exbar algorithm, which is an exact algorithm to infer the minimal consistent DFA. To compare the performance of the proposed method with the Exbar algorithm, we calculate the precision and recall at different numbers of SMB protocol training samples for both two algorithms. The results are shown in Figs. 5 and 6. 1) About the impact of the number of training samples. The number of training samples has a relatively large influence on the recall of these two algorithms, and the recall is rising along with the increase of training samples. Exbar performs slightly better than EDSM in recall. By using an improved state labeling algorithm in our system, the accuracy of the two algorithms reaches 100% at the number of different training samples, which avoids merging invalid state pairs. So, the state machine we inferred is not over generalized, and it is unlikely to generate protocol sessions that cannot be accepted by the reference state machine.
2) About the data processing capacity. When the number of training samples exceeds 50, Exbar will not run. EDSM is an approximate algorithm that can ensure inferring state machine in polynomial time. However, Exbar is an exact algorithm using backtracking search strategy, and tries to find an optimal result with the least states in an almost exhaustive manner. Each time Exbar is called recursively, it only tries to merge a pair of states or promote a blue node to red, and the depth of the search will continue to increase. Once a search fails, Exbar algorithm will return to the last position and choose another search direction, which may easily fall into infinite backtracking and cannot exit. Fig. 7 depicts the search path of Exbar and EDSM when inferring state machines.  In order to compare the time consumption of the two algorithms more directly, we record their time consumption and search times under different numbers of SMB training samples (see Tab. 4). It can be seen from the results that the running time of the two algorithms grows with the increase of the number of training samples. But for each test set, EDSM takes less time than Exbar. When the sample number in the test set is equal to or greater than 100, Exbar would fail to give a result while EDSM just takes a little longer time to display the results. The reason for this phenomenon can be explained by the difference in computational complexity between Exbar and EDSM. As an approximation algorithm, the time complexity of the EDSM algorithm is much lower than that of Exbar, which makes it more effective when the data size increases. Therefore, the EDSM algorithm has a better performance in the face of large data and higher real-time requirements.

Conclusion
This paper proposes a valid approach to refer the state machine of unknown binary protocols from network traces, especially when the training samples are large. Firstly, we improve the substitution matrix of multiple sequence alignments to identify variable length fields and remove them. Then, we extract state relevant fields by analyzing their statistical characteristics. To infer a minimal DFA consistent with training samples, we optimize a state labeling algorithm and apply an optimized EDSM algorithm to complete the final state merge.
In order to validate the method implemented in this paper, we test our system on three typical binary protocols: TCP, SMB and DHCP. The experimental results show that the state machine we inferred is reliable in terms of both completeness and soundness. Compared with the Exbar algorithm, the experiment results show that when Exbar totally fails, our system has better performances in processing a large number of training samples. In some application environments, such as instruction detection, malicious traffic recognition and other network protection mechanisms, the ability to handle with big data provided by EDSM algorithm will be more practical.
In the future, the method for selecting an appropriate algorithm to accomplish the state machine reconstructing needs to be studied in depth. As this article proves, EDSM performs better on large data, but has a slightly lower recall rate than Exbar. We will continue working on this topic and find an automatic mechanism to utilize the proper algorithm to obtain better protocol reverse results. Besides, combined with the present method, the Markov Model could also be considered to deal with the packet missing problem in the future.