Scalable and Fast Algorithm for Constructing Phylogenetic Trees With Application to IoT Malware Clustering

With the development of IoT devices, there is a rapid increase in new types of IoT malware and variants, causing social problems. The malware’s phylogenetic tree has been used in many studies for malware clustering or better understanding of malware evolution. However, when dealing with a large-scale malware set, conventional methods for constructing a phylogenetic tree is very time-consuming or even cannot be done in a realistic time. To solve this problem, we propose a high-speed, scalable phylogenetic tree construction algorithm with a clustering algorithm to cluster it. The proposed method involves the following steps: (1) Calculating the similarity of the specimen pairs using the normalized compression distance. (2) Creating a phylogenetic tree containing all specimens, instead of calculating the similarity of all pairs of a specimen, our algorithm only calculates a small part of the similarity matrix. (3) Dividing the phylogenetic tree into clusters by applying the minimum description length criterion. In addition, we propose a new online processing algorithm to add new malware specimens into the existing phylogenetic tree sequentially. Our goal is to reduce the computational cost of constructing the phylogenetic tree and improve the clustering accuracy of our previous research. We evaluated our method’s clustering accuracy and scalability with 65,494 IoT malware specimens. The results showed that our algorithm reduced the computation by 97.52% compared with the conventional method. Our clustering algorithm achieved accuracies of 95.5% and 99.3% for clustering family name and architecture name, respectively.


I. INTRODUCTION
IoT malware has seen rapid proliferation in recent years, and a large amount of malware is newly spread every day. Honeypots and anti-malware platforms (such as VirusTotal 1 ) collect large amounts of malware samples to keep track of popular malware trends. By analyzing these malware samples, it is possible to know the threat of malware to cyberspace.
The associate editor coordinating the review of this manuscript and approving it for publication was Easter Selvan Suviseshamuthu . 1 https://www.virustotal.com To analyze the massive amount of malware specimens, efficient malware analysis methods are required.
Clustering is a compelling method to analyze these large-scale malware sets efficiently. In this paper, we focus on constructing a phylogenetic tree for malware clustering. We have a particular interest in malware's phylogenetic tree because it can not only cluster malware but also investigate the evolutionary relationships of malware specimens. However, conventional methods for constructing a phylogenetic tree are very time-consuming when facing a large malware set. Therefore, Our goal is to achieve automatic clustering from a large-scale malware specimen set by constructing a phylogenetic tree. Our method is outlined as follows: we first calculate the distances between malware specimens and then use these distances to create a phylogenetic tree. Finally, we divide the phylogenetic tree into clusters.
We use the normalized compression distance (NCD) to measure the similarity (distance) between execution-type malware binaries. NCD does not require domain knowledge and can be applied to many file formats. NCD measures the distance based on the similarity of two malware's binary sequences. The higher the similarity between the two binary sequences, the smaller the NCD [6], [8]. Specimens of the same IoT malware family are often created based on the same source code with minor modifications [2]. Therefore, these specimens have high binary similarities and can be expected to be divided into close clusters.
A typical method for creating a phylogenetic tree is the neighbor-joining method [26]. The neighbor-joining method takes a distance matrix as an input and outputs a phylogenetic tree with tree distance approximating the input distance matrix. However, the neighbor-joining method requires calculating all pairs of distances, which means (N 2 +N )/2 times of compression attempts in our case. Here, N is the number of data samples. The computation time of the NCD matrix is problematic when N is large. Therefore, our previous work proposed a scalable method for clustering malware by constructing a phylogenetic tree [12]. 2 Our previous method consisted of a fast phylogenetic tree construction algorithm and a clustering algorithm. By calculating only a tiny part of the distance matrix, we achieved good scalability while maintaining the accuracy of clustering. Instead of creating large phylogenetic trees all at once, our basic idea was to create small phylogenetic trees and combine them into one big phylogenetic tree. Using the abovementioned algorithm, we can create a phylogenetic tree with far fewer NCD computations than (N 2 + N )/2 times. Clustering was achieved by appropriately dividing the phylogenetic tree into subtrees. We applied our method and constructed a phylogenetic tree of 4,109 IoT malware specimens in [12].
But when we applied our previous method to a much larger, multi-architecture malware set, the clustering accuracy dropped significantly. Hence, in this paper, we applied a new clustering algorithm: the minimum description length (MDL) criterion as the division criterion [25]. Based on the information criterion, the MDL Criterion decides whether to divide a cluster (subtree) into smaller clusters. We not only improved the clustering accuracy but also used a much larger dataset containing 65,494 malware specimens to evaluate our method's scalability. Furthermore, considering that new specimens are added to the dataset daily or weekly, reconstructing the phylogenetic tree every time new specimens are collected is time-consuming and resource-intensive. Therefore, we also proposed an online processing algorithm that directly adds 2 This paper was partly presented at the International Conference on Neural Information Processing (ICONIP), 2019 [12]. The clustering algorithm and experiments were updated. new specimens to the phylogenetic tree without fully reconstructing it.
Our contributions are threefold: • We present a scalable and fast algorithm to construct phylogenetic trees, which significantly reduces the computational cost.
• We present a scalable clustering algorithm applying the MDL Criterion. 95.5% and 99.3% accuracies of clustering family name and architecture name were achieved, respectively.
• We present an online processing algorithm that successively adds new specimens into the phylogenetic tree while maintaining the clustering accuracy.
The rest of this paper is organized as follows. The second section introduces the related studies. The third section introduces existing methods we applied in our study. The fourth section presents our newly proposed algorithm, i.e., clustering algorithm and online processing algorithm. In the fifth section, we discuss the application of our algorithm with 65,494 specimens of IoT malware. In particular, we show how much we reduce the computational cost and the extent to which we improve the clustering accuracy. Finally, the last section discusses the limitations and directions for future research.

II. RELATED WORK A. BINARY-LEVEL STATIC ANALYSIS
We applied NCD for feature extraction that directly calculates the similarities between malware's byte sequences. Details about NCD will be introduced in subsection III-A. Other binary-level feature extraction methods include N-grams [16], [23], [35], entropy-based [11], [22], and image-based techniques [5], [7], [20], which convert a file's entire sequence of bytes into a picture, where each byte represents the grayscale of a pixel.
Compared to N-gram, N-gram lacks long-term dependence, because it only considers the N-1 bytes before the current byte. Moreover, in practice, N is usually ranged from 2 to 8 due to the computation complexity [16]. On the contrary, NCD's feature extraction is based on the compression algorithm. In our method, we chose Lempel-Ziv-Markov chain algorithm (LZMA) as the compression algorithm. Markov chain algorithm has much longer-term dependence than N-gram so that Markov chain algorithm can match more same patterns in the bytes sequences. Furthermore, N-gram can only use N-byte words as the dictionary, but LZMA has a more flexible dictionary.
In the entropy-based feature extraction, information in the bytes sequences will lose while converting the byte sequences into entropy. Different byte sequences can result in the same entropy value.
The image-based method converts byte sequences to an image. Each byte represents the grey scale [0-255] of a pixel. The image can later be used as input for Neural Networks or other methods. The advantage of this method is its robustness [1]. Our method and N-gram match exactly the same pattern between byte sequences, while Neural Networks can deal with minor alterations.

B. PHYLOGENETIC TREE FOR MALWARE CLUSTERING
Many studies have applied the phylogenetic tree to cluster malware samples and help the analyst to better understand the evolutionary relationships between malware. But almost all of these studies analyzed small malware sets. Because their methods require the computation of the whole distance matrix, the O(N 2 ) complexity makes it impossible to apply these methods to a large malware set. A summary table of related study constructing a phylogenetic tree is shown in Table 1.
Karim et al. [17] proposed a method to calculate the distance between all malware pairs using n-perms and created a phylogenetic tree. Wehner [33] also used NCD to calculate the distance between 790 malware samples. They did not mention their method for constructing the phylogenetic tree, but they noted that they calculated all pairs of malware's distance. In [14], Hsiao et al. extracted malware features based on dynamic analysis, and the unweighted pair group method with arithmetic mean was used to construct their phylogenetic tree. The size of the dataset was 1,200. Vinod et al. [32] used IDA, a binary code analysis tool for reverse engineering, to transfer executable malware to assembly code and calculated the similarity based on it. They used the neighbor-joining method to construct the phylogenetic tree. Their experiment contained 1,200 malware specimens. Bailey et al. [3] proposed a representative automatic clustering method based on the dynamic analysis of malware. Using the dynamic analysis log of 3,700 Windows malware samples as input, the NCD matrix between the operation log data of all the samples was calculated, and then a phylogenetic tree was created using the shortest distance method. In addition, they proposed a clustering criterion called Inconsistency Coefficient and performed hierarchical clustering based on it. We also used it in previous studies. However, there was a problem in that the clustering accuracy dropped significantly for a large-scale, multi-architecture phylogenetic tree. Therefore, in this study, we improved the clustering method and performed clustering using the MDL Criterion. All these studies calculated all pairs of malware distance, because any O(N 2 ) algorithm does not scale well, they are not suitable for constructing large-scale phylogenetic trees.
To the best of our knowledge, [9] is the only related study on the construction of large-scale phylogenetic trees. Cozzi et al. separately constructed phylogenetic trees for different architectures, and the largest one contains 36 thousand malware samples. They used hierarchical navigable small world graphs (HNSW) to reduce the distance computation significantly. Compared to their study, our phylogenetic tree almost doubles their size.

C. LARGE-SCALE MALWARE CLUSTERING
Although there is only one related study about large-scale phylogenetic tree construction, there exist several approaches to reduce the computational cost of large-scale malware clustering. A summary table of large-scale malware clustering is shown in Table 2.
Bayer et al. [4] proposed a large-scale clustering method based on dynamic analysis. They used the malware dynamic analysis log as input for clustering and performed clustering by calculating approximately 2% of the distance matrix without calculating the distance between all data pairs using the locality sensitive hash (LSH). However, like Bailey et al.'s method, dynamic analysis requires several minutes to run each malware specimen. By contrast, our static analysis method does not require the malware to be executed.
Oliver et al. [21] proposed a clustering method based on a static analysis that uses trend locality sensitive hashing (TLSH) to transfer every malware binary into fixed-length hash digests. They then clustered all digests into clusters using a clustering method called HAC-T within O(N log N ) time. Torabi et al. [31] proposed a strings-based method to analyse and cluster 49,272 IoT malware. They extracted useful strings and calculated its similarities using Jaccard and overlap similarity coefficients. Moreover, they applied ClusterONE [34] algorithm to perform clustering analysis. Rieck et al. [24] and Hu et al. [15] took a different approach: a prototype-based clustering algorithm that reduces runtime complexity by performing clustering only on representative samples (prototypes). The remaining malware specimens are associated with their closest prototype in the feature space. Dam et al. [10] also took a very different approach: they use IDA pro to extract system call graph for each malware and transfer graphs into vectors applying LSTM.

III. PRELIMINARIES
In this section, we introduce two existing methods we applied in our study.

A. NORMALIZED COMPRESSION DISTANCE
In this study, we used NCD to measure the similarity (i.e., distance) between IoT malware binaries. NCD is an information-theoretic measure of the similarity between two objects [19]. The similarity is calculated based on the compression rate of the objects. NCD measures the similarity of objects regardless of their format or structure, e.g., documents, pictures, programs, music, etc.
For two objects x and y, NCD is defined as follows: where C(x) is the length of x that is compressed by compression program C, and xy is the concatenation of objects x and y. When compressing xy, the compression program will first compress x and then use the information of x to compress y. Thus, the more similar x and y are, the higher is the compression rate of xy. This means that the value of C(xy) will be similar to that of C(x) or C(y) and result in the NCD value close to zero.

B. NEIGHBOR-JOINING METHOD
The neighbor-joining method [26] is an algorithm for creating a phylogenetic tree T using a distance matrix d defined on a finite set L as input. Every leaf of the phylogenetic tree represents an object of L. The tree distance between two nodes of T is defined by the total branch length of the path between the two nodes. The tree distance is an approximation of the input distance matrix d.
In the neighbor-joining method, two nodes i and j that minimize d ′ ij , defined below, join a newly created node p.
where d ij is the (i, j) entry of the given distance matrix d.
Nodes i and j are deleted from the set L, and node p is added branch length of{u, p} is defined as: branch length of {v, p} is defined as: instead. This step is repeated until all the nodes are linked. The pseudocode of the neighbor-joining method is shown in Algorithm 1, where D is the set of branch lengths.

IV. PROPOSED METHOD
In this section, we introduce the fast algorithm for constructing a phylogenetic tree, the new clustering algorithm applying the MDL Criterion and the online processing algorithm.

A. FAST ALGORITHM FOR CONSTRUCTING A PHYLOGENETIC TREE
The neighbor-joining method requires a complete distance matrix to construct a phylogenetic tree. The O(N 2 ) computational cost is a problem when the dataset is large. Therefore, we proposed a fast and scalable algorithm that only needs to calculate a small part of the distance matrix to construct a phylogenetic tree.
The algorithm is outlined as follows. The schematic diagram and the flow chart are shown in Fig. 1 and Fig. 2, respectively. First, the algorithm randomly selects a seeds set S ⊂ L (|S| = k, |L| = N ) with k (k ≪ N ). Then, it calculates the distance matrix between S and L, and, using the distance matrix over S × S, creates a phylogenetic tree T using the neighbor-joining method. For each element z in L \ S, using the distance matrix over (L \ S) × S, it links it with the leaf e of T , which is nearest to the element z.
To increase the approximation accuracy of the tree distance between set L \ S, it recalculates the tree distance recursively for each Z (e). If |Z (e)| < h, it calculates the distance matrix over Z (e) ∪ {e} and creates a phylogenetic tree T Z (e) with it. It then combines T Z (e) and T . Here, h is a predefined threshold. e is included in the recomputation to know which part of T Z (e) corresponds to T . If |Z (e)| > h, then it recursively uses this algorithm for Z (e) ∪ e. The pseudocode of the phylogenetic tree construction algorithm is shown in Algorithm 2, where ∂T denotes a set of all leaves of T .  Instead of calculating all pairs of distances, only the colored parts in Fig. 1 are calculated in Algorithm 2. As a randomized algorithm, Algorithm 2's computational cost is O(N 2 ) in the worst case. But in usual cases, the computational cost is reduced from O(N 2 ) to O(N log N ). Details of the computational cost are presented in the appendix.

B. CLUSTERING ALGORITHM
Since the burden on the analyst cannot be reduced by simply constructing the phylogenetic tree, it is necessary to divide the phylogenetic tree into appropriate clusters. In our previous study, we achieved a good clustering accuracy by using the clustering method based on the Inconsistency Coefficient [3], but when it was applied to a larger, multi-architecture dataset, performance decreased considerably. To solve this problem, we changed the clustering method to MDL Criterion [25].
The MDL Criterion is a model selection criterion based on information theory [25]. MDL Criterion is defined as below: (1)

Algorithm 2 Fast Algorithm for Constructing a Phylogenetic Tree
Require: finite dataset L, size k of seeds set, threshold h Ensure: phylogenetic tree T 1: choose a certain seeds set S ⊂ L with |S| = k 2: calculate the distances d ij for (i, j) ∈ S × L 3: create a phylogenetic tree T for S by the Neighbor-joining method using d ij 4: for e ∈ ∂T do 5: Z (e) = ∅ 6: end for 7: for z ∈ L\S do 8: Z (e) = Z (e) ∪ {z} where e is nearest to z 9: end for 10: for e ∈ ∂T do 11: if |Z (e)| > h then 12: recursively use Algorithm 2 for Z (e) ∪ {e} 13: end if 14: if 1 < |Z (e)| < h then 15: calculate d ij for (i, j) ∈ (Z (e) ∪ {e}) 2 and create a phylogenetic tree T Z (e) with it. 16: replace the corresponding parts of T with T Z (e) 17: end if 18:

end for
where L is the likelihood function, θ is the maximum likelihood estimate, m j is the number of dimensions of θ, n is the number of data samples. The description length includes the description length of the model and the description length of the data when the model is given. When using a complicated model, the data description length can be short, but the description length of the model becomes long. When using a simple model, the description length of the model is short, but the description length of the data becomes long. An appropriate model should be selected by balancing both and minimizing the total description length.
The outline of the clustering algorithm is shown below. First, given one cluster C, it is assumed that the data x contained in the cluster follows a normal distribution, and the description length (DL) of the cluster is calculated by the following.
Here, L is the likelihood function, θ is the maximum likelihood estimate, n is the number of data samples in the cluster, and m j is the number of dimensions of the parameter.
In our model, this equation cannot be directly calculated because it is originally designed for Euclidean space. We must approximate it as follows: x i −x is calculated using the tree distance between the malware i and the central malwarex. The central malware is defined as the malware that has the smallest sum of squares of the column elements in the Algorithm 3 Algorithm for Clustering a Phylogenetic Tree Require: Phylogenetic tree T Ensure: Clusters Tc 1: Function Cluster(C) 2: find the central node of C. 3: divide the cluster C into C 1 , C 2 , C 3 at the central node. 4: if DL(C 1 , C 2 , C 3 ) > DL(C) then 5: Tc = Tc ∪ C 6: end if 7: if DL(C 1 , C 2 , C 3 ) < DL(C) then 8: Cluster(C 1 ) 9: Cluster(C 2 ) 10: Cluster(C 3 ) 11: end if 12: EndFunction 13: Tc = ∅ 14: Tc = Cluster(T ) 15: for T i ∈ Tc do 16: if |T i | < 100 then 17: Tn= T i 's nearest cluster 18: C= merge T i and Tn 19: while count < 10 do 20: if DL(T i , Tn) > DL(C) then 21: Tc= Tc − T i − Tn Tn=T i 's next nearest cluster 26: count++1 27: end if 28: end while 29: end if 30: end for cluster's tree distance matrix. m j becomes a free parameter that determines the fineness of the division of the phylogenetic tree. Note that different information criteria like Akaike information criterion (AIC) or Bayesian information criterion (BIC), or MDL Criterion are only different at the weight of m j , so since we regard m j as a free parameter and use parameter tuning to choose the best m j , apply any information criterion is actually the same.
The phylogenetic tree is a 3-regular graph (which means every node has three neighbors); therefore, when dividing a cluster (subtree), it will be divided into three subtrees, viz., C 1 , C 2 , and C 3 . We divide a cluster at its central node, which is defined as the node which has the smallest sum of the squares of the column elements in the tree distance matrix. The description length of the divided cluster is calculated by: + 3m j log n + n(log 2π + 1). Here, x k is the malware sample contained in the cluster C k , and n k is its number.
where e is the central malware of cluster T i 4: end for 5: calculate the distances d ij for (i, j) ∈ S × L 6: for each malware l ∈ L do 7: insert l into nearest cluster 8: end for 9: for each T i ∈ Tc do 10: re-construct the phylogenetic tree of T i 11: end for 12: join all clusters into new phylogenetic tree T ′ If DL(C 1 , C 2 , C 3 ) > DL(C), the cluster before the division is considered better, and the division is rejected.
If DL(C 1 , C 2 , C 3 ) < DL(C), the cluster after division is considered to be better, and the cluster will be divided into three clusters. The determination of division will continue for each small cluster.
While dividing the phylogenetic tree, many small clusters are formed. After completing the division, we merge these small clusters into their respective close clusters. The distance between clusters is defined by the tree distance between their central nodes. The merging process was conducted targeting the clusters containing one hundred or fewer specimens with other clusters in the order of closer distance. The merging is only done when the description length DL(C) of the cluster after merging is smaller than the description length DL(C 1 , C 2 ) of the clusters before merging. The pseudocode of the algorithm is shown in Algorithm 3.

C. ONLINE PROCESSING ALGORITHM
Since new malware specimens are collected by honeypots every day, reconstructing the phylogenetic tree each time new specimens are added to the dataset will be time-consuming. Therefore, we propose an online processing algorithm that adds new specimens to the existing phylogenetic tree.
After creating a phylogenetic tree and clustering it, we calculate the distance between center specimens of every cluster and all new specimens. The center specimen of a cluster is the specimen that has the minimum sum of the square of the distance to other specimens in the cluster. We assign each new specimen to its nearest cluster. To determine where the new specimens should be inserted, we re-create the phylogenetic tree of each cluster using the neighbor-joining method. Note that the distance matrix of the original cluster has been calculated; therefore, the new distance matrix only requires an extremely small amount of NCD computations. Finally,  all clusters are joined back to a phylogenetic tree. The pseudocode of the algorithm is shown in Algorithm 4.
The evaluation of the proposed algorithm is introduced in the next section.

V. EXPERIMENT AND RESULTS
In this section, we evaluate our algorithm with 64,494 IoT malware. In particular, it shows 1) how well our fast algorithm reduces the computational cost compared with the neighbor-joining method, 2) how much the MDL Criterion improves the clustering accuracy compared with Inconsistency Coefficient, and 3) how well our online processing algorithm reduces the computation time while maintaining the clustering accuracy.

A. DATASET
We collected 65,494 Linux malware specimens (mostly IoT malware) from VirusTotal, from November 2018 to February 2019. The breakdown of the dataset is shown in Table 3. The malware family name was decided by AVClass [28]. AVClass is implemented as a Python tool to label malware samples using VirusTotal JSON reports as input.

B. EXPERIMENTAL SETUP
In this subsection, we introduce the detailed setup and evaluation metrics of our experiments.

1) SETUP
For performance comparison, we constructed phylogenetic trees with both the our fast algorithm and the neighbor-joining method. We then clustered them with both the MDL Criterion and the Inconsistency Coefficient, which is the method we applied previously. To evaluate the clustering performance, we used 10-fold cross-validation. After clustering 90% of the specimens, we assigned the remaining 10% to their nearest cluster to calculate the accuracy.
Moreover, to investigate the impact of dataset size on computational cost reduction and clustering accuracy, the experiment included two parts: First, we evaluated our algorithm with a similar number of specimens as in our previous study: 4,000 specimens, which are randomly selected from the malware set. Second, we evaluated our algorithm with 65,494 specimens.
In the online processing experiment, we compared it to batch processing. The results show that we saved the time of reconstructing the phylogenetic tree when new specimens are added to the dataset while maintaining the same level of clustering accuracy.

2) PARAMETER TUNING
The size k of the Seed set S was set to 1% of the number of specimens. The recursive computation threshold h was set to 5000. The clustering parameter m j was chosen from 4, 5, 7, 10, 15, 20, 25, 30.

3) IMPLEMENTATION
We implemented our algorithm using R. The compression program xz command (version 5.1.0 alpha) of Linux was used to compress malware binaries for computing NCDs. xz adopted the Lempel-Ziv-Markov chain algorithm (LZMA) [27] for compression.
As a benchmark for our algorithm, we used a conventional scheme: it first compressed every pair of 65,494 malware binaries to compute the NCDs between them, and then constructed the phylogenetic tree of the malware with the neighbor-joining method described in subsection III-B. Because of the large size of the dataset, using R's library to run the neighbor-joining method was impossible; therefore, we used the library available for the textttJulia programming language as an alternative [29].
The experiment was conducted on a 2.6 GHz Intel-Xenon-Gold-6126 CPU. In our algorithm and the conventional scheme, the compression of NCD was computed in parallel with 80 threads.

1) RCR:
To measure how well the compression attempts are reduced by our fast algorithm, we define the rate of compression-attempt reduction RCR as follows: where N is the number of specimens, which equals 65,494. The RCR decreases as the compression attempts are further reduced by our algorithm. VOLUME 11, 2023  2) clustering accuracy: The clustering accuracy was evaluated by a 10-fold cross-validation. Specifically, the malware set was divided into ten parts, and clusters were created using 90% of the specimens. Each of the remaining 10% test specimens was assigned to their nearest cluster. If the family name of the cluster and the family name of the test specimen is the same, then the test specimen is correctly clustered. The architecture name accuracy is calculated in the same way. The distance between a specimen and cluster is defined by the NCD between the specimen and cluster's center specimen. The cluster's family name is decided by the majority specimen's family name in that cluster.

1) SMALL-SCALE EXPERIMENT WITH 4,000 SPECIMENS
This experiment is designed to investigate the impact of dataset size on RCR and clustering accuracy, which will be introduced in next subsubsection. In the experiment using 4,000 specimens, the RCR was 94.01%. The clustering accuracy is shown in Fig. 3. The red line represents our algorithm, and the green line represents the neighborjoining method. After constructing the phylogenetic trees, they were both divided into clusters based on the MDL Criterion. In Fig. 3(a) and Fig. 3(b), the horizontal axis is the parameter m j introduced in subsection IV-B that determines how finely the phylogenetic tree is divided. As m j becomes smaller, the phylogenetic tree is divided into smaller clusters, whose number increases exponentially, and the clustering accuracy also increases simultaneously. Our algorithm achieved a slightly higher family name clustering accuracy than the neighbor-joining method, especially when m j is large.
For the same m j , a slightly smaller number of clusters were created using the neighbor-joining method. Because the larger number of clusters, the more choices for test specimens, and the higher clustering accuracy, we changed the horizontal axis variable to the number of clusters in Fig. 4(c) and Fig. 4(d) to show the difference better. However, our algorithm still achieves a slightly higher accuracy. Fig. 3(c) and Fig. 3(d) show the accuracy achieved by our algorithm in clustering family name and architecture name, respectively. For comparison, we used different methods to construct and cluster the phylogenetic tree. The combinations of different methods are: • Our fast algorithm and MDL Criterion, represented by the red line.
• Our fast algorithm and the Inconsistency Coefficient, represented by the blue line. • The neighbor-joining method and MDL Criterion, represented by the green line.
• The neighbor-joining method and the Inconsistency Coefficient, represented by the purple line. Our fast algorithm combined with the MDL Criterion achieved the best clustering accuracy among the four methods. The results show that our proposed algorithm successfully maintains the clustering accuracies of the neighbor-joining method while significantly reducing the computational cost. Moreover, the clustering algorithm applying the MDL Criterion successfully improved the clustering accuracies than our previous work. The Inconsistency Coefficient is a clustering criterion designed by Bailey et al. [3] based on experience. We believe our method achieved a better result because we have a better theoretical basis. 8248 VOLUME 11, 2023 FIGURE 5. Phylogenetic tree constructed using our algorithm.

2) LARGE-SCALE EXPERIMENT WITH 65,494 SPECIMENS
The phylogenetic tree constructed using 65,494 specimens is shown in Fig. 5, where every malware is colored according to its family name. The red points represent Bashlite, the green points represent Mirai, and other colors represent other families. Our fast algorithm reduced the number of compression attempts by 97.52 % compared with the neighbor-joining method. This result shows that the larger the dataset, the higher the RCR achieved. This is because our fast algorithm reduces the computational cost from O(N 2 ) to O (N log N ), and N / log N increases with N . To calculate the input NCD matrix of the neighborjoining method, (65, 494 2 + 65, 494)/2 pairs of malware specimen are compressed in 80 threads with a 2.6 GHz Intel-Xenon-Gold-6126 CPU, which took 110 days. As for the neighbor-joining algorithm, we used the PhyloNetworks [29], a library of Julia, to construct the phylogenetic tree. The neighbor-joining algorithm was executed in 1 thread and took ten days. On the contrary, our method calculated only 2.48% of the NCD matrix. Including the time for tree construction, it only took three days.
It can be seen from Fig. 4(c) and Fig. 4(d) that our algorithm achieved a slightly higher clustering accuracy than the neighbor-joining method. In addition, it achieved the best accuracy among the four methods. Comparing the red and the blue lines, our clustering algorithm based on MDL Criterion significantly improved the family name clustering accuracy and architecture clustering accuracy. For instance, when m j = 7, the phylogenetic tree was divided into 1808 clusters. The family name accuracy of the MDL Criterion was 94.1%, and the architecture name accuracy was 99.1%. Compared with the Inconsistency Coefficient with the same number of clusters, the family name accuracy was 89.6%, and the architecture name accuracy was 96.1%, which is an improvement of 4.5 percentage points of family name accuracy and 3.0 percentage points of architecture name accuracy. This result also shows that the larger the dataset, the higher the clustering accuracy achieved. We believe this is because the size of the seed set is set to 1% of the size of the dataset. The larger dataset can result in a smaller variation of the seed set, which means the randomly picked seed set can better represent the whole dataset.
We randomly pick one case in the 10-fold cross-validation experiment and show the confusion matrix of the test malware set in Table. 4. Although we used an unbalanced dataset, the proportion of the Bashlite and Mirai family groups is over 86%, but our algorithm can cluster minor families successfully. Table. 4 shows that we achieve high accuracy for most minor families.

3) EXPERIMENT OF ONLINE PROCESSING ALGORITHM
For the purpose of simulating the continuous addition of new specimens to the dataset, we used 54% specimens to construct the phylogenetic tree and added 9% specimens into the phylogenetic tree using our proposed algorithm at one time. After online processing for four times, which adds 36% specimens into the phylogenetic tree, we used the last 10% specimens to test the clustering accuracy. Fig. 6(a) shows that our online processing algorithm achieved close accuracy compared to batch processing, the clustering accuracy was only reduced by about 1%. Fig. 6(b) shows that our online processing algorithm not only saved the time of re-construct a new phylogenetic tree but also reduced the computational cost of the NCD matrix. When m j = 7, the malware family name accuracy of batch processing was 94.1%, the accuracy of the online processing was 93.2%, and the computation rate of the NCD matrix was reduced from 2.48% to 1.66%. In Fig. 6(b), batch processing is a straight line because its computational cost does not rely on clustering, in another word, the parameter m j will not affect its computational cost. The computation rate is determined when the phylogenetic tree is constructed.
Our online processing algorithm saves the time of reconstructing the phylogenetic every time some new specimens are added to the dataset while maintaining the clustering accuracy.
We also investigated if the size of the phylogenetic tree or the properties of the specimens to be inserted will influence the clustering accuracy. We designed two different kind of experiments.
In experiment 1, we investigated how the size of the phylogenetic tree affects the online processing algorithm's accuracy. The specimens for constructing the phylogenetic tree and the specimens to be inserted into the phylogenetic using the online processing algorithm are both randomly selected from our 65,494 specimens dataset. The size of the phylogenetic tree and the size of the specimens to be inserted to the phylogenetic tree at one time are set up as follows: • 1) 5000 and 1000, the experiment result is shown in Fig. 8(a).
• 2) 10000 and 2000, the experiment result is shown in Fig. 8(b). VOLUME 11, 2023 FIGURE 6. Results of online processing algorithm compared with batch processing.
• 3) 20000 and 4000, the experiment result is shown in Fig. 8(c). The larger size of the phylogenetic tree is, the higher clustering accuracy is achieved. This is because the larger size of the phylogenetic tree contains more clusters, which provides a higher probability for the test malware to find the nearer cluster. But the clustering accuracy did not fall as the online processing going on. We believed this is because that the online processing malware set has the similar properties with the malware in the phylogenetic tree.
In the experiment 2, we investigated how the properties of the new specimens affects the online processing algorithm's accuracy. Fig. 7 shows the average distance among specimens in different periods. We can see from  Fig. 9: if the properties of the online processing specimens changed from the original specimens, the clustering accuracy of the family name was decreased to under 70%. The accuracies rise as the online processing goes on because the more specimens are inserted into the phylogenetic tree, the more similar the specimens in the phylogenetic tree and the specimens to be inserted are. After inserting enough specimens, the specimens in the phylogenetic tree and the specimens to be inserted will have similar properties, which lead to the accuracies being steady, like in experiment 1. The purple and blue lines show the clustering accuracies when recreating the phylogenetic tree after adding new specimens to the dataset (batch processing). Therefore, when the properties of the online processing specimens change from the original specimens, a recreation of the phylogenetic tree to maintain the clustering accuracy is needed.

A. PERFORMANCE GUARANTEE
Our phylogenetic tree construction algorithm is a randomized algorithm; therefore, there is no performance guarantee. However, we measured the standard deviation of the clustering accuracy of the 10-fold cross-validation in Fig. 4. As shown in the figure, the standard deviations of the clustering accuracy are small. For instance, the standard deviation is 0.33% when m j = 7, and the accuracy of the malware family name is 94.07%. Therefore, it is reasonable to believe that the randomness does not significantly affect the clustering result.

B. LIMITATION
Since the similarity of malware specimens is measured by NCD, it depends on the binary similarity between the two specimens. Thus, encrypted or packed malware whose binary has been significantly changed is difficult to cluster. Although unpack tools can solve this problem to some extent, we currently cannot deal with those packed malware that cannot be unpacked.

C. NCD's ADVANTAGES
We applied NCD for feature extraction that directly calculates the similarities between malware's binary code. Compared to other Sota feature extraction methods, our method does not need the dynamic analysis environment or the time to analyze each malware. Sota feature extraction method using dynamic analysis is effective but it takes time to collect a large amount of data because the malware must be executed at least once before the logs can be collected. Moreover, static analysis using the disassembler tool IDA pro is also a Sota feature extraction method. However, it also takes a long time to disassemble all the malware. Another advantage of the NCD approach is its flexibility. It does not require any domain knowledge and can apply to many other data formats, such as text, video, and music, without requiring re-designed.

D. PHYLOGENETIC TREE's ADVANTAGES
Our method created the largest phylogenetic tree and achieved a high clustering accuracy at about 94% to 95%.   But if only regarding the clustering scalability and regardless of the phylogenetic tree construction, many studies have achieved better scalability. Compared to these studies, the advantage of constructing a malware's phylogenetic tree is that clustering using a phylogenetic tree can provide more information about the malware's evolution path, which is connected to our future work.

E. FUTURE WORKS
For our future works, we are attempting to build a hybrid platform that integrates various cyber threat analysis methods [30]. As part of this project, we are working on the challenge of static analysis of malware.
Kawasoe et al. [18] have proposed investigating malware specimen functions by generating Function Call Sequence Graphs (FCSG). Our future work aims to combine the method of Kawasoe et al. and the clustering results of this study to capture the functionality transition and evolution of malware variants.
As another future work, we are using information theory to choose the seed smartly in the recursion part of the phylogenetic tree construction to reach a better computation reduction rate.
Moreover, we are also using active learning to develop a new algorithm for constructing a phylogenetic tree [13]. The active learning algorithm allows us to smartly select the most informative part of the distance matrix to calculate. We expect this method would achieve better scalability.

VII. CONCLUSION
In this paper, we proposed a scalable, efficient, and automatic method for large-scale malware clustering. We presented a fast algorithm for constructing the phylogenetic tree and an algorithm applying the MDL Criterion to clustering it. The computational cost of the fast algorithm is O (N log N ), which significantly reduced the computational cost than the conventional method. Furthermore, we proposed an online processing algorithm that can reduce the computation cost in actual operation by skipping the phylogenetic tree reconstruction while maintaining the clustering accuracy. We evaluate our algorithms' scalability and clustering accuracy using a large-scale IoT malware set.
Our experiments using 65,494 IoT malware specimens show that our fast algorithm reduced the computational cost by 97.52 % compared to the neighbor-joining method. By improving the clustering algorithm using MDL Criterion, our clustering algorithm achieved significantly higher accuracy than our previous method. In the best case, the family name clustering accuracy was 95.5% and the architecture name accuracy was 99.3%. Furthermore, the online processing algorithm reduced 33% of the computational cost than the batch processing algorithm while maintaining a clustering accuracy of 94%. Our experiment results show that our method successfully reduces a significant amount of computational cost. The O(N log N ) complexity makes our algorithm scalable for large-scale malware sets. To the best of our knowledge, our work constructed the largest malware phylogenetic tree. Moreover, it was previously impossible to construct a malware phylogenetic tree at our size in a realistic time. Now, we can construct and cluster it in real-time.

APPENDIX. COMPUTATION COST
In this appendix section, we will show the detailed explanation of how our algorithm reduces the computation cost. As a randomize algorithm, In the worst case, the time complexity of our algorithm is O(N 2 ) and the computation reduction rate is 0%. But in most case, the time complexity of our algorithm is O(N log N ) and the computation reduction rate is over 95%.
Let N be the size of dataset L, k be the size of seeds set S and h be the threshold. The worst case is that all data in set L\S are assigned to a single subset, for example Z (e 0 ), and other subsets Z (e 1 ) to Z (e k−1 ) remain empty. And if this happens every time we recursively applying our algorithm for subset Z (e 0 ), our algorithm will finally calculate the whole distance matrix.
However, in usual case, these k data randomly picked from the dataset can represent the whole dataset to some extent. for the convenience of computation, we assume that data in set L\S are splited into Z (e 0 ) to Z (e k−1 ) equally. N data are recursive divided into k subsets until the size of the subset is smaller than h, so the recursive calculations will totally be log N h k −1 times. The computation cost of seed set is represented by the red parts in Fig.10, and it is calculated as: 1 2 k 2 (1+k +k 2 +, . . . , +k log The computation cost for assigning the L\S is represented by the green parts, and it is calculated as: (N log N ).
Finally, the computation cost of neighbor-joining method is represented by the blue parts, and it is calculated as: