Malware detection based on semi-supervised learning with malware visualization

: The traditional signature-based detection method requires detailed manual analysis to extract the signatures of malicious samples, and requires a large number of manual markers to maintain the signature library, which brings a great time and resource costs, and makes it difficult to adapt to the rapid generation and mutation of malware. Methods based on traditional machine learning often require a lot of time and resources in sample labeling, which results in a sufficient inventory of unlabeled samples but not directly usable. In view of these issues, this paper proposes an effective malware classification framework based on malware visualization and semi-supervised learning. This framework includes mainly three parts: malware visualization, feature extraction, and classification algorithm. Firstly, binary files are processed directly through visual methods, without assembly, decompression, and decryption; Then the global and local features of the gray image are extracted, and the visual image features extracted are fused on the whole by a special feature fusion method to eliminate the exclusion between different feature variables. Finally, an improved collaborative learning algorithm is proposed to continuously train and optimize the classifier by introducing features of inexpensive unlabeled samples. The proposed framework was evaluated over two extensively researched benchmark datasets, i.e., Malimg and Microsoft. The results show that compared with traditional machine learning algorithms, the improved collaborative learning algorithm can not only reduce the cost of sample labeling but also can continuously improve the model performance through the input of unlabeled samples, thereby achieving higher classification accuracy.


Introduction
Malware has become a major threat to network security [1]. Traditional signature-based methods extract binary signatures from malware to build a huge feature library, which provides comprehensive information of malicious samples but requires much time and effort [2]. Meanwhile, the enormous malware variations also brought great challenges to signature-based detection methods.
In recent years, many new malware detection methods have been proposed, ranging from multisignature methods to static analysis, dynamic detection and heuristic detection. However, the antidetection technology also constantly improved. Malware has learned to change their feature through object-code obfuscation, code refactoring and etc. According to the report of Symantec and McAfee, approximately 69 new instances of malware are generated per minute, and more than 50% of them are variants of existing malware [3]. The traditional feature extraction method cannot afford the huge cost brought by manually marking new samples. These new variants usually have the same malicious intentions and characteristics as the original malware [4,5]. Such a group of malware samples with similar attacking patterns is called a malware family. Recognition of malware families relies on quickly analyzing the behaviors and functions of malware.
In face of the new challenges, some researchers try to explore malware features using machine learning technologies [4][5][6]. Nataraj proposed to categorize the malware family by visualizing malware [7]. The method not only shows the visual similarity between different samples in the same family but also adapts to the common fuzzy coding techniques. Accordingly, neural networks are also utilized to analyze visual malware and achieved promising results [8,9]. However, due to the complexity of the neural network, the huge cost of the training process makes it hard to catch up with the rapid growth of malware variants [10][11][12]. Besides, most of the active neural networks need supervised learning, requiring human experts and special tools to abstract the malware features and labels of new samples, which is an extremely expensive and inefficient process [13].
Regarding these problems, we proposed a malware detection model based on malware visualization and collaborative learning. In the model, firstly the malware binaries are visualized to gray graphs, and then the malware family features are automatically extracted from the graphs using image feature extractors, such as LBP extractor. Finally, the features are sent to the cooperative learning models with multiple classifiers to recognize malware. Furthermore, noise learning theory is incorporated into the training process to exclude noise in the unlabeled samples to ensure that the model's error classification rate could be continuously reduced.
The contributions in this paper are summarized as follows: • We proposed a malware classification framework, which integrated malware visualization, automatic feature extraction, and collaborative learning. The framework directly processes malware binaries without disassembly, decompression, and decryption.
• This framework continuously improved the classification ability of the model through the introduction of unlabeled samples，which solved the issue of lack of labeled malicious samples in actual scenes.
• Comparative experiments were conducted on two widely studied imbalanced benchmark datasets, Malimg and Microsoft. Experimental results show that the proposed framework can achieve excellent classification performance, with the accuracy of 0.98 and 0.94, respectively. Compared with the state-of-the-art methods, our method is more resistant to the effects of data imbalance.

Development of malware detection
With the development of machine learning and visualization technology, researchers have begun to draw on the visualization ideas in the field of computer forensics and network monitoring to visualize and classify malware [7]. First, the raw binary data in the malware is converted to grayscale images, in which each byte is represented as a grayscale pixel of the image. These byte sequences are then combined into a 2D array, and feature extraction is performed through texture analysis to convert the malware detection into an image classification task. In such a way, not only the characteristic information of software can be visualized, but also the detection efficiency of malware is improved compared with the traditional methods [14][15][16]. Furthermore, in contrast with traditional static analysis methods, malware visualization is more suitable for malicious samples adapted by obfuscation technology [17].
Recently, quantities of visual detection methods based on malware have been proposed [18,19]. However, these methods still have some shortages, including the lack of available labeled samples in real applications, the great gap between feature extraction algorithms and visual feedback, and etc. [20] Therefore, we apply a semi-supervised learning algorithm to alleviate the issue of insufficient samples and continuously improves the classification performance through the utilization of unlabeled samples along with noise learning theory.

Semi-supervised learning algorithm
Supervised algorithms have achieved promising performance of malware detection, however, they rely on plenty of labeled samples for training, which is difficult to be satisfied in real applications [20,21]. On the other hand, unsupervised learning can employ unlabeled samples for training but often gets lower accuracy [22]. Compared with these two types of classification algorithms, the semi-supervised learning algorithm only needs a small number of labeled samples in the training stage and can continuously enhance detection performance through the use of a large number of unlabeled samples [23,24]. Consequently, semi-supervised learning is more suitable for the applications of malware detection.
The semi-supervised learning algorithm co-training [25] assumes that we have two redundant and independent feature views to deal with data. And then, at the initial training stage, some labeled samples are summited to two basic classifiers in different feature views. After initial training, unlabeled samples with high label confidence are selected and these "pseudo-label" samples are put into the updating set for further training. Through the process of "learning from each other and making progress together", the classifiers iteratively updating in each training round until their performances are stable. However, the conditional independence of the two feature views is difficult to satisfy. S. Goldman and Y. Zhou proposed to improve the classifiers by collaborative learning [26]. Although this method has removed the requirement of redundant feature views, it still restricts the types of base classifiers, and the repeated ten-fold cross-validation in the updating process results in an overwhelming cost. In response to this problem, Zhi-Hua Zhou et al. proposed a Tri-training algorithm that neither requires sufficient redundant views nor restricts the type of classifiers [27]. The algorithm easily handles the problem of labeling confidence estimation and predictive classification of unknown samples by using three collaborative classifiers.
In this paper, the idea of noise learning theory [28] is utilized to improve the collaborative learning based on the original Tri-training algorithm. In each iteration stage, parts of "pseudo-label" samples are extracted for error rate calculations and threshold evaluation to reduce the tag error rate of "pseudolabel". For a detailed description of the model, see Section 3.

Methodology
The work of this paper focuses on the detection of malware. Figure 1 shows the architecture of our malware classification model, which mainly consists of three major components: malware visualization, feature extraction, and the tri-training classification algorithm.

Malware visualization
As shown in the first stage in Figure 1, The malware visualization transforms the binary codes into images with certain characteristic information [7,14]. For a given binary file, each 8-bit is transformed into one unsigned integer, and the result of these variables is reorganized into a twodimensional matrix. The corresponding value in the matrix can be expressed as the gray value of the generated image in the range of [0, 255], where 0 and 255 represent black and white respectively. Figure 2 shows the visualization process of a binary malware file into a grayscale image. The images converted from malware have the same width and different heights. The widths of images of different sizes are required to be pondered carefully, in case the generated images are extremely high or too wide, which degrades the performance of feature extraction [7]. Table 1 gives some recommended image widths for the malicious samples.

Feature extraction
Take the obfuscated code technologies into consideration, such as fragment encryption and instruction substitution in consideration, the texture features of the generated gray-scale image may be disturbed and deformed. Hence, to ensure that the image features utilized in the training are stable and robust, both the local texture features and global features of the images are extracted and fused through the canonical correlation analysis (CCA) method [29].

Feature selection
Before feature extraction, the SIFT feature [30], HOG feature [31], and LBP feature [32] are analyzed and compared for the local feature extraction of malicious samples' gray images [33]. SIFT features can keep the rotation, brightness, and scale of the image unchanged, but demand enormous computation. Since the characteristic gamma correction is performed at the end of the image grayscale calculation, the feature extracted by HOG can reduce the negative impact of local shadow and light changes. Nevertheless, the HOG extraction has some defects, such as the prolix generation process and sensitivity to noise points, which leads to high-risk costs. Thus, a more steady and efficient feature extraction method is required.
Many malicious samples are derived from some classic malware. Source from the structure similarity of the malware family, the pixels of the converted gray images have some equal proportions on the whole or in some local areas [7,20]. Similarly, for the LBP feature extraction, the relative size of the central pixel and the overall gray level of the neighborhood remain unchanged even if they change simultaneously [34,35]. Furthermore, The LBP descriptor has benign adaptability to the effect of image rotation. Depending on the nature of this event, the LBP method has high adaptability and robustness for the detection of different groups of malicious samples.
Therefore, LBP is utilized to extract the local features. It employs a descriptor window as a circular area, for a neighborhood with pixels containing a one-pixel set{ 0 , 1 , . . . , −1 } , and the encoding method of LBP is as: where and represent the gray value of the central pixel and circular neighborhood pixel respectively, and is the radius of the neighborhood. For such a circular LBP operator, the relative position of the center pixel and changes with the image rotation, resulting in various LBP values. Consequently, we adopt the uniform rotation invariant LBP operator, which can adapt to image rotation and has anti-noise property for a large number of modes generated in the circular neighborhood of different sizes, defined by: and represents the LBP uniformity space conversion times (the transition between binary bits 0/1), expressed as: The LBP feature has the advantage of stability and noise anti-interference for the regional feature description, which mainly focuses on the local feature while lacking the global feature description of the images. Therefore, after extracting the LBP features of the malicious sample image, the global feature of the image was also extracted, this reduced the noise through Gaussian fuzzy processing. The image is divided into grids by the average size of 16 × 16. And the global image feature computes the mean value and average variance of pixel intensity in each grid of image where the two-dimensional Gaussian function is employed to calculate the weight of each pixel shown in (4).
( , ) = where is the variance of .

Feature fusion
To enhance the correlation between features and images, canonical correlation analysis (CCA) is applied to the fusion of image features after extracting the global and local features. CCA is an effective multi-data processing method [36], which can mine the potential association relationship between two sets of variables to obtain more representative data. The general idea of CCA is to obtain the maximum correlation coefficient between the linear combination of two groups of variables through a large number of matrix calculations, to establish the relationship between the two groups of variables [37]. The CCA feature fusion produces a structure containing null hypothesis (H0) and related information after the fusion of the two features X(local) and Y(global), which takes the minimum values of the ranks of the two feature matrices X and Y: ( )) . Relevant information statistics of fusion are shown in Table 2.
If the prediction of and on is correct, will get a valid new example for further training; otherwise, will get a noise example with an incorrect label.
According to the noise learning [28] theory of Angluin and Laird, for a training sequence containing m samples, if the sample size is: where is the upper limit of the classifier error and is the classification error of the training set rate, is the number of categories, is the confidence parameter, then the PAC (probably approximately correct identification) judgment for the real classification h* is: where (ℎ , ℎ * ) represents the sum of the probabilities of the elements of the symmetric difference between the hypothesis (classifier) ℎ and the real situation ℎ * . Given the confidence parameter and the upper limit of classification error , formula (8) can be transformed into formula (9): Expand the left side of formula (8), we can get: That is, for the given confidence parameter and the upper limit of classification error , We need to guarantee that: To simplify the calculation of equation (10), let = 2 ( 2 ), where the coefficient that makes equation (11) is equal, and then we can get: It can be seen from formula (12) that the square term of the upper limit of error is inversely proportional to (1 − 4 ). Samples that are temporarily marked by two of the classifiers in each round can be called pseudo-marked samples. Since the number of unlabeled examples selected for each round of tri-training is not fixed, let be the pseudo-sample set labeled 1 for the ℎ round, and the error rate of sample detection is , then for the sample size of this round = | | + | |. Compared with the previous round, if the training result of this round is improved, it is necessary to ensure that the (13) must be satisfied: That is, updating the classifier by the | | unknown samples introduced in this round can continue to improve the classification performance, because the upper limit error of the training result of the ℎ round is lower than that of the − 1 ℎ round; otherwise, the newly-labeled samples of this round are abandoned and start re-sample training in . For equation (13), the sample detection error rate is = | |+ | | | + | , where and represent the error rate of the classifier on the labeled sample set and the unlabeled sample set during the ℎ round of training respectively. To ensure that the training process can continuously reduce the upper limit of classification error , substituting the expression into equation (13), we can get: In most cases, for the classification error rate of the model on the labeled samples, << , thus can be ignored. So equation (14) can be simplified as (15).
Therefore, in the tri-training process of the classifiers , , ( , , ∈ {1,2,3}， , ≠ ), for the two adjacent rounds of training, when the size and error rate of the new labeled samples meet equation (15), the new labeled samples with the same labels from and are submitted to for updating; otherwise, the newly labeled samples in this round are discarded, and unlabeled samples are re-selected from for training. For the classification error rate of the "pseudo-labeled" samples that cannot be directly calculated, this paper adopts the idea of ten-fold cross-validation. In each round of the iterative training, 1 10 of the labeled samples are randomly selected from as the test set to estimate , the remaining samples in are combined with U for trained together. The Tri-training training with noise judgment is listed in Table 3.

Dataset
To evaluate the performance of the proposed method, we carried a group of comparisons based on the Malimg data set [7] and the Microsoft Malware Classification Challenge data set [38] (referred to as the Microsoft data set). The description of The Microsoft and Malimg datasets are shown in Table  4 and Table 5, respectively. Tables 4 and 5 show that the distribution proportions of malicious samples in the Malimg dataset and the Microsoft dataset are imbalanced, and the proportions of different sample groups are different. For example, in the Malimg data set, these two sample families of ALLAPLE in the 25 sample families accounted for 48.6% of the total, while the remaining 23 samples only accounted for 51.4%; in the Microsoft data set, the Simda samples in the 9 sample families only accounted for 0.4% of the total. For traditional supervised learning algorithms, the imbalanced distribution of malicious samples often leads to overfitting and poor classification performance.

Feature fusion methods analysis
Firstly, the LBP feature and the average grid gray intensity of the malicious sample images in the Malimg data set are extracted. And then CCA fusion is carried out for these two features. Finally, the fusion results are submitted to a co-learning model for the training of classifiers. The results of the classifications are shown in Figure 3. It can be seen that as the proportion of training data increases, the classification results on CCA are better，which is attributed to the suppression of the repellency between the two variables by the CCA method. By calculating the correlation coefficient of the two one-dimensional data obtained by linear transformation projection, the CCA method maximizes the correlation between the two dimensions, thus obtaining more discriminative data characteristics. Moreover, after CCA fusion the dimensionality of the data is reduced, which greatly saved the training cost of co-learning. The time comparison between traditional serial fusion and CCA fusion in different sample sizes (in seconds) is shown in Table 6. And from the table, we can see that, after CCA fusion, the time cost is dramatically reduced.

Algorithm performance analysis
5 alternative linear classifiers are employed for collaborative learning. To ensure the degree of divergence among the cooperative learning classifiers, we eventually selected random forest, KNN, and LR as the three candidate classifiers 1 , 2 and 3 for tri-training. In the experiment, the three classifiers are initialized with differentiated sample characteristics at first. Then the three classifiers are trained by a cooperative learning algorithm periodically. After the training, different types of malicious samples are predicted on the test set and compared with traditional machine learning classifiers.
We have conducted an experimental effect analysis of the collaborative learning algorithm on the Malimg data set, as shown in Figure 4. In the experiment, 1 , 2 and 3 respectively represent the selected single classical classifier, and T represents the cooperative learning algorithm satisfying the noise learning theory. As shown in Figure 4, due to the low sample size at the beginning, the classification performance of all training algorithms is poor. With the continuous learning from unlabeled samples, the classifier is continuously improved, while the classification performance of collaborative learning has raised more obviously; when the number of unlabeled samples reaches 3000, the accuracy of fusion classification can exceed 95%, and the accuracy is raised until the number reaches 3400. The results demonstrated that the collaborative learning algorithm can continuously improve classification accuracy through a continuous utilization of unlabeled samples. However, for such a data set with imbalanced samples, simply calculating the correct rate of the model cannot comprehensively reflect the advantages of the proposed model. Consequently, in addition to calculating the accuracy of the overall model, the Precision, Recall, and F1-score of the model are also calculated to further evaluate our model. The (precision), (recall) and (F1score) of each given class are calculated first, and then average the F1 scores of all classes to calculate the weighted-average F1 score. The calculation formulas of , and are shown in Equations (16)- (19). and where is the number of samples that are correctly classified in the ℎ category, is the number of samples that are misclassified into the ℎ category, and indicates how many samples belonging to the ℎ class are misclassified to other classes.
The evaluation results based on Malimg and Microsoft are shown in Table 7 and Table 8. The Malimg dataset has an F1-score of up to 97.23%, while the accuracy rate of the Microsoft dataset can reach 94.09%. Therefore, compared with other classic classification algorithms, our method can not only reduce the input cost of labeled samples but also has better detection accuracy.

Conclusions and future work
We propose a new malware classification model based on malware visualization, and co-training of classifiers, and shows that combining the malware visual method with tri-training can provide a better discriminative pattern of malware families. In this framework, the malware is transformed into grayscale images by visual methods, then a fusion method based on CCA is utilized to fuse the local and global features extracted from the gray image to reduce time cost and improve feature relevance; finally, three basic classifiers are collaboratively training based on the tri-training schemes. In each round of collaborative learning, the new labeled samples are filtered by noise learning theory which ensures a continuous improvement of the overall performance of the co-learning results and alleviates the problem that the labeled samples are difficult to obtain in practical applications through the incorporate of unlabeled data into the training process.
The advantages of our method are manifold. Firstly, the experimental results show that the proposed method achieves good classification performances of 0.98 and 0.94 on Malimg dataset and Microsoft dataset, respectively. Second, our approach is more resistant to data imbalances. Thirdly, the tri-training algorithm improves the classification ability of the model through the introduction of a large number of cheap unlabeled samples and reduces the noise impact caused by the lack of labeled samples. Although the accuracy of the collaborative learning algorithm is improved after iterative training, it increases the time overhead. In future work, the iterative updating efficiency of the collaborative learning algorithm needs to be further improved, such as introduce some more complex models which are more suitable for image classification.