Automatic determination of types number of mixed binary protocols

In the absence of prior knowledge, it is a challenge to determine the number of protocol types in frames, which are completely unknown. These frames might be mixed with multiple protocols from the data link layer to the application layer. In this study, the authors combine the spectral clustering algorithm with a method of determining the number of protocol frame types, and further more design the refinement clustering of different hierarchical protocols based on the eigenvectors of the Laplace matrix. They use three clustering validity indices, which are Calinski–Harabasz index, Davies–Bouldinn index and Silhouette index, to quantify the clustering effect in order to calculate the number of protocol types. Extensive experiments on several open datasets obtain relatively satisfying results without prior knowledge and demonstrate the significant advantages of their methods clearly.

• Protocol frames consist of binary data and rarely have recognisable text characters. • Protocols are completely unknown from the data link layer to the application layer, such as those in satellite communications. • The protocol is mixed in multiple layers. Fig. 1 illustrates an example of internet protocol where there are two kinds of protocols in layer_2 and five in layer_3.
As on the internet, the mixture of unknown binary protocols in different layers brings many difficulties to PRE. To achieve a reverse analysis of unknown protocols, we need to separate the mixed protocols into independent types firstly. Without prior knowledge, only relative clustering methods can be used to classify the mixed binary protocol frames. There are several difficulties in this part of work as follow: • Binary protocol frames are mixed in different protocol layers. The effective distinguishing of them has not been researched. • It is hard to find an effective way to determine the clustering quality to obtain good results at each layer. • Binary protocol frames are difficult to be expressed in Euclidean space. We cannot obtain the centroids of clusters. Furthermore, without these centroids we can hardly calculate clustering validity index(CVI). Generally, CVIs are used to evaluate the clustering effectiveness. They can directly estimate the quality of a partition by measuring the compactness and separation of the clusters.
In this study, we propose a novel method to cluster the mixed binary protocol frames based on spectral clustering algorithm. The main contributions are as follows: • Affinity matrix construction is a crucial step in clustering. We use the Smith-Waterman algorithm to get the pairwise similarity values and construct the affinity matrix of binary frames. • We use three CVIs to guide the clustering process. Some researchers had compared nearly 30 indices by different properties and characteristics [11][12][13]. We select three best performing indices, which are, respectively, Calinski-Harabasz index, Davies-Bouldinn index and Silhouette index. The calculation method of these CVIs is given by the affinity matrix and the intuitionistic explanation is given. • The optimal values of CVI are used to distinguish protocol frames in the different layers.

Related work
Recently, researches on the classification of unknown protocols have two main aspects. One is based on semi-supervised learning and the other is based on unsupervised learning.
The semi-supervised method can be used to analyse some unknown protocol data. Based on Erman et al.'s study of flow correlation [14,15], multiple studies used three-tuple {destination IP, destination Port, protocol} in training data to label unknown traffic. The main difference among these studies is that the labelled data was calculated by different clustering algorithms. Zhang et al. [16] used stream correlation and binary search to optimise parameters, thus improved the robustness of the classifier. Lin et al. [17] used the K-means algorithm to classify labelled data. They checked whether the labelled data was contained in the classification result to determine whether there were unknown protocols. Wang et al. [18] applied three constrained variants of the K-means algorithm to merge background information by performing hard or soft constraint satisfaction and distance metric learning during clustering. Ran et al. [19] used an alternative approach to clustering unknown streams by dynamically adding cluster centres and iterating semi-supervised K-means to select optimal system parameters. However, these above semi-supervised methods must have labelled training data, and their implied premise is that the unknown traffic and the training data are at least from the same protocol at the data link layer, network layer, or even transport layer.
In the absence of prior knowledge and training data, some studies used unsupervised learning methods, such as clustering, for the classification of unknown protocol data. Clustering works suffer from two main problems: model selection and appropriate grouping [20]. The determination of cluster number is the key of the grouping method and it will directly affect the quality of clustering. Classical clustering methods include expectation maximisation-based [21][22][23], K-means algorithm-based [24,25], and DBSCAN algorithm-based [26]. These methods aggregate clusters according to the similarity of traffic statistics and then establish the mapping relationship between clusters and classes. Some new methods have emerged in recent years. Wei et al. [27]proposed the optimised affinity propagation algorithm, which focused on calculating the similarity between samples. Although they improved the Gaussian kernel function to reduce the influence of the scale parameter σ, this algorithm was essentially the same as the density-based statistics. By setting the threshold in advance, samples with similarity smaller than the threshold would not be divided into the same cluster. Wang et al. [28] proposed an automatic unknown application protocol classification method by extracting the characteristics of unknown protocols and using an extended ROCK algorithm. Yu et al. [29] used service-based statistical features for cluster analysis and used the payload analysis tool (nDPI) to identify unknown protocols. There are other methods designed for malicious network traffic detection and analysis, such as the UNIDS proposed by Pedro et al. [30], which integrated subspace-based clustering, density-based clustering, and evidence accumulation. UNIDS can be used to identify unknown malware.
On the one hand, these unsupervised learning algorithms need to abstract the network traffics as vectors in multidimensional space from some important features, such as IP addresses, ports, session time and so on, which leaded to introduce certain prior knowledge in the actual application process. This also means that only the ethernet protocols can be classified and the unknown part of the protocol is mainly concentrated on the application layer. For protocols that are completely unknown from the link layer to the application layer, such as satellite communications, private custom wireless communications etc., the above methods are impossible to determine whether there is a five-tuple like Ethernet. Since these protocols are completely unknown, it is difficult to represent the feature vectors for calculation without prior knowledge. On the other hand, the most important thing is that these unsupervised methods must set the number of clusters large enough to obtain high-purity clusters, and until now there do not exist in-depth study on the determination of the number of clusters.
For the calculation of the number of clusters, Wen et al. used GMM [31] to obtain the clusters by continuously approximating the distribution of the original data. The number of GMM components was the number of clusters. However, the GMM tended easily to overfit, as a result, the data in the same cluster would be divided into different clusters. Some studies used graphbased clustering methods. They focused on how to construct effective affinity matrices, such as constructing affinity matrices through representations of low rank spaces [32,33], and calculating affinity matrices by strengthening global constraints [34]. These methods were used to calculate clusters. However, affinity matrix solving processes were inherently time consuming and there was no clear indicator when solving the number of clusters.

Problem statement
Mixed binary communication protocol data is unstructured and the lengths of protocol frames are different, which is difficult to represent in an unknown unified multi-dimensional space. It is also hard to determine whether it has convexity in the multidimensional space or not and to know what the data density distribution is. Furthermore, the spectral clustering can analyse the non-convex data. In this study, we apply the spectral clustering method to cluster the unknown mixed protocol data and separate the data. This method is achieved by constructing the Laplace matrix and solving the matrix eigenvector to complete the clustering of the protocol. Then, three kinds of CVIs are used to evaluate the final clustering results. According to the definition of the spectral clustering method and CVI, the mathematical model is established as where f (k) is the computational function of CVI including f CH (k), f DB (k) and f Sil (k) in Section 4.2. Unlike f CH (k) and f sil (k), f DB (k) needs to solve the minimum in Eq. 1. In order to unify the form, we add a minus sign before f DB (k). L is the Laplace matrix and L = D − S, where S is the affinity matrix and D is the degree matrix, which can be obtained via S easily. h 1 , h 2 , …, h k are the first k eigenvectors of matrix L sym = D −1/2 LD −1/2 . This model is a second-order constraint optimisation model. If f (k) can be expressed by h i , the second-order optimisation problem will be transformed into a first-order optimisation problem, and the problem can be solved by the gradient descent method. Unfortunately, the value of f (k) is determined by the row vectors of matrix V, which is constructed by indication vectors such as {h 1 , …, h k }, rather than the column vectors. Also, the row vectors are reclassified to determine the value of f (k). Therefore, it is very difficult to address this problem. In this study, we split the calculation process into two phases when using the model: firstly, the value of h i is calculated, then f (k) is calculated according to h i .

Mixed unknown binary protocols classification method
In this section, we present a detailed solution to the mathematical model, which we have established. We propose an algorithm to compute the number of protocol types.

Protocol similarity measure and distance calculation
We can infer that the affinity matrix S is the key to calculate the indication vectors {h 1 , …, h k } from Section 3. However, the construction of S is troublesome whose elements are the similarities between binary communication protocol frames. These frames are unstructured data, and the length of each frame is usually different. Thus, it is difficult to express in an unified multidimensional space when the protocol is unknown. We cannot directly use the general calculation method (such as Euclidean distance, cosine distance, and so on) to calculate the distance between two frames. It is also difficult to extract and use the frequent item sets, and characterise their various features because there are few printable characters in the binary protocols. Also, the probability of occurrence of each frequent term in the frame is different. If feature vectors are used to represent the binary frames, these frames will show high-dimensional and sparse characteristics, which make the similarity between the frames and difficulties in clustering. Therefore, we introduce Smith-Waterman [35] algorithm to measure the similarity between unknown protocols. Smith-Waterman algorithm is described as follows: Let A = a 1 a 2 , …, a n and B = b 1 b 2 , …, b m be the sequences to be aligned, where n and m are the lengths of A and B, respectively. Construct a scoring matrix F and initialise its first row and first column. The size of the scoring matrix is (n + 1) * (m + 1) and F k0 = F 0l = 0(0 ≤ k ≤ n and 0 ≤ l ≤ m). Fill the scoring matrix using F(i, j) as shown in (2) where σ(a, b) is the similarity score of the elements that constituted the two sequences. Finally, starting at the highest score in the scoring matrix F and ending at a matrix cell that has a score of 0, trace back based on the source of each score recursively to generate the best local alignment.
So affinity matrix S ∈ ℝ n × n of protocol frames can be obtained by Smith-Waterman algorithm.
In order to quantify the clustering effect, most CVI calculations need to calculate the data centroid, but it is difficult to calculate the centroid of protocol frames when it is not expressed as a point in the multidimensional space. Therefore, in the calculation of f (k), different index calculation requires distance matrix and dot product matrix. The spectral clustering uses an affinity matrix, which typically uses a radial basis function to convert between distance and similarity. This conversion requires pre-setting of parameters and affects the clustering results. In this study, the Smith-Waterman algorithm is used to calculate the similarity of frames to generate an affinity matrix. Since dot product can also express similarity measurement, affinity matrix is regarded as a dot product matrix directly.
Let X = x 1 , x 2 , …, x n , x n represent a protocol frame. In order to facilitate the conversion between the distance matrix and the affinity matrix, let x i be a vector in a multidimensional space, and the length of the x i is unified as a constant value 1/2. Let s i j = x i T x j , similarity measure can be described as and where s i j is the dot product of x i and x j , and the affinity matrix is formed by s i j .

Clustering validity indices
Almost all of the CVIs are calculated by intra-class and inter-class distances with a number of other factors, such as square error, statistical attributes etc. We determine the number of clusters by the computational function f (k) of CVI, however, three CVIs mentioned in Section 1 need to use clustering centroid. For the special data in this study, we use the dot product to calculate CVIs in the case of difficult to obtain the centroid and give the intuitive explanation of three CVIs.

Calinski-Harabasz index
The first CVI is Calinski-Harabasz index whose expression is shown in (5) where S B is an inter-class scatter matrix and S W is an intra-class scatter matrix where they can be described as where Σ is the covariance matrix of all samples as shown in (8) The affinity matrix is a symmetrical matrix of size n × n. It can be treated as a square. The protocol frames have been clustered into k classes and n 1 , n 2 , …, n k are the number of samples of each classes. Assuming that the affinity matrix of protocol frames is moved after clustering, the same type of frame is reordered and displayed graphically as shown in Fig. 2. The pseudo area of a block graph is the sum of affinity matrix elements covered by it as shown in (11). For example, the pseudo area of the top left corner of the figure is the sum of affinity elements of n 1 class. In the same principle, the line segment in the figure represents the sum of the elements covered by it, such as line i where s i j is the similarity measure between the ith and jth frame (that referred to the dot product). The first term in (9) is the sum of diagonal term (Fig. 2a), and the second term is the weighting sum of each cluster area (Fig. 2b). The weight of the clustering data number for the reciprocal describes as 1/n 1 , 1/n 2 , …, 1/n k . The first term in (10)  the second term in (9), and the second term is the sum of all elements in the affinity matrix with a weight of 1/n (Fig. 2c).

Davies-Bouldinn index
The second CVI is Davies-Bouldinn index whose expression is shown in (12) where S i represents the value of intra-class dispersion when q = 2 is specified. According to the calculation rule of (4), we can obtain where A i and A l are centroids of the ith and lth class, respectively, ∑ r, s = 1 n i s rs is the area of the ith cluster indicated in Fig. 2b, and n l s yz is the area between two clusters indicated in Fig.  2d.

Silhouette index
The third CVI is Silhouette index whose expression is shown in (15) f Sil (k) = 1 Also, due to the calculation rule from (4), we can obtain a i = 1 Let us consider Fig. 2 as a distance matrix. At the red part of the ith row indicate the distance between frame i and other frames in the same cluster. a i represents the mean of elements of the red part. The black parts indicate the distance between frame i and frames in other clusters. b i represents the minimum value of the mean of other clusters.

Multi-layer refinement classification
The majority of communication protocols are designed for multilayers such as internet's five-layer protocol. In this study, protocol frames were obtained from the link layer. So these frame contains different layers of the protocol, which bring great trouble for the classification. It needs a hierarchical classification. The purpose of classification is to use the same type of data in PRE. Since there is no prior knowledge, it is impossible to get the same data as the protocol specification. We try to design refinement classification and achieve high quality data cluster, which reduces input error in PRE.
The data is from the link layer to the application layer even including a physical layer for data synchronisation. Usually, highlayer protocol types are the most and low-layer types are the least. For example in internet protocol (Fig. 1), there is one kind of protocol in layer_1 while five in layer_3. However, the determination of classification number is a hard problem because layer_1 and layer_2 exist, some frames are likely separated into the same class despite different layer_3. We propose a multi-layer optimal clustering algorithm for the problem Step 1: Given a set of binary frames X = {x 1 , x 2 , …, x n }. We compute the pairwise similarity values of the binary frames as described in Section 4.1 to obtain the affinity matrix S. The ith row and jth column element of the matrix S is denoted by s i j .
Step 2: After the affinity matrix obtained by Step 1, we cluster the protocol frames based on the spectral clustering algorithm. Firstly, L sym is computed by normalised transformation of matrix S. Then the first k eigenvectors of L sym are solved. These eigenvectors are constructed into a new matrix V. Finally, the K-means algorithm is used to cluster the row vectors of V to obtain the clustering results in one iterative calculation.
Step 3: In order to determine k automatically, we use the CVI described in Section 4.2 to help determine the number of clusters in the spectral clustering process. The three CVIs are weighted (denoted by f mul ( ⋅ )) to determine the excellent CVI value and reduce the one-sidedness of a single index. If the feature vectors are continuously clustered CVI will inevitably have multiple local optimal values to determine different layers of clustering. This also means achieving the goal of multi-layer classification. Since CVI values are discrete, we introduce the concept of stagnation and inflection point like a continuous function to judge CVI and determine k as (18). Different clustering layers are determined by queue K, where k ∈ K.
≤ λ and f mul (k) ≥ f mul (k − 1) . (18) Therefore, we designed a multi-layer optimal clustering method as shown in Algorithm 1 (see, Fig. 3), where α, β, and γ are weights of three CVIs, λ is a balance parameter used for determining the inflection point, c is the maximum possible value of k.
We can see that the main computational process of Algorithm 1 (Fig. 3) is divided into the calculation of affinity matrix, symmetric matrix eigenvector, row vector clustering and three CVIs in which the runtime complexity of Davies-Bouldinn index is higher than that of the other two CVIs. Therefore, the runtime complexity of Algorithm 1 (Fig. 3) can be obtained by the four processes as shown in Table 1. where n is the number of frames, l is the average length of frames, m is the rank of the affinity matrix, d is the average number of non-zero elements in a row of the matrix, k is the number of clusters and s is the number of iterations. So the complexity of the Algorithm 1 (Fig. 3) is O((l 2 + k 2 + d)n 2 + ksn). The complexity of Algorithm 1 (Fig. 3) is relatively high, which is mainly reflected in the similarity calculation. However, this study is aimed at the binary data frames of unknown protocol, which must be accurate in the similarity calculation. Therefore, the Smith-Waterman algorithm with high complexity can be adopted. In addition, how to solve this problem is also our follow-up research.

Dataset
In order to verify the validity of our algorithm, we have got some real world data from pcapr [36], an open web site, which provides packets for various protocols. We used six datasets to test, as shown in Table 2. Each dataset contains several protocols. We use Wireshark, which is a useful and open-source tool for grabbing analytical protocol data, to parse and view our data as a benchmark. In the testing process, we treated all data as completely unknown data. The experimental results are compared with benchmark provised by Wireshark to verify the validity of our algorithm. This study mainly focuses on the classification of various protocol frames and does not infer the type name of the protocol. So the test results are only compared with classification results indicated by Wireshark.

First clustering
To determine the parameters, we set c = 14. This value is not obtained based on prior knowledge. However, it can be estimated according to the size of the protocol packets because protocol types should not be large in a small packet. In order to calculate f mul (k), we set α = 0.5, β = 0.3, γ = 0.2 and λ ≤ 0.2. According to Table 2, Fig. 4-6 indicate that the first determined cluster number can be obtained. The results show that the correctness of the first determined cluster number is 100%. In Fig.  4, the CVI, f mul (k), is monotonically increased or decreased before the best clustering results achieved. The first clustering number is determined just in the case of the global optimal value. In Fig. 5, the number of clusters is determined based on the first local optimal value. The above two cases are λ ≤ 0. In Fig. 6, the values of CVI are compared within adjacent clusters, and the number of first clusters is determined if the inflection points are at λ > 0. In any case, the number of the first clustering can be accurately obtained by CVI values of different cluster numbers.
After determining the number of clusters for the first time, it is important to verify the quality of clusters in detail. We validate our algorithm by ratios of accuracy and recall which are used to evaluate most classification methods. In our algorithm, the frames identified as the same type of protocol in Wireshark are also clustered into the same cluster. Fig. 7 compares the results of our algorithm with a benchmark provided by Wireshark. We find that the ratios of accuracy and recall are above 93% for most data after the first clustering of our algorithm. Some clustering results are inconsistent with the benchmark, such as DarkComet. The 26th and 28th frames are HTTP protocols, but clustering results are   clustered with TCP frames together. HTTP is the upper application protocol of TCP originally and HTTP frames can be easily clustered into TCP frames set because the similarity among the frames is calculated from the link layer to the application layer. In the previous data link layer, the network layer and transport layer data will affect the results.

Second clustering
Owing to the layered design of protocol families, we can determine the type of low-layer protocol according to our algorithm after the first clustering. In further clustering, we can get a more detailed protocol type at higher layers. We can illustrate the process of two clusters as shown in Fig. 8. Assuming that the first clustering divides the data into three types, such as A, B and C, there are different similarities of the data in the A class and the second clustering process divides A into A 1 and A 2 . However, the different values of the data in B and C are relatively smaller, they remain in the same type. We show the results of Binnettravler and Darkcomet by the second clustering in Table 3. We can see that frames are subdivided into more high-layer protocol types by the second clustering. Our datasets are mixed with a variety of frames. Lengths and payloads of frames are not the same. In the second clustering, some frames are further subdivided, and other frames still maintain the same type. Since similarity values of frames are relatively large, they are maintained at the same type from bottom to upper layer of the protocol. No new error clusters had been introduced which makes the protocol clustering maintain the hierarchical characteristics. We also find that the numbers of protocol types are correct after the second clustering (seven as shown in Fig. 5a, 12 as shown in Fig. 6a), which is consistent with Table 3, respectively.

Conclusion
This study aims to solve the problem of multi-layer clustering of mixed binary protocol frames. We use the spectral clustering method combined with CVIs to establish the second-order optimisation mathematical model of the problem. Although this model is difficult to be converted to a first-order optimisation model, this study gives a feasible calculation method to calculate the model. Afterwards, we give a concrete algorithm of CVI for the protocol data in the absence of centroid and an interpretation to facilitate calculation. Based on that, a clustering algorithm of mixed binary protocol frames is proposed, which can achieve multi-layers clustering of frames without any prior knowledge. The results of experiments show that our algorithm can work well. Good quality and high similar data sets can be provided to format inference, which is another important task of PRE by multi-layers clustering.