Histogram Entropy Representation and Prototype based Machine Learning Approach for Malware Family Classification

The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an L × 256 map according to the preselected parameter L. The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity.


I. INTRODUCTION
Malware is software that is installed silently and secretly on computers, servers, clients, and networks to perform actions that users do not expect. Computers connected to the network are more likely to spread malware and pose a significant threat to the advancement of information and communication technologies. Malware that has recently been discovered is spreading out through its own evasion technology as well as advanced vulnerability analysis technology. The detection and response to new or modified malware is critical to the ad-vancement of information technology, and ongoing research and improvement efforts are required.
Malware with evasion technology can be detected through continuous monitoring, but it takes an inordinate amount of time and effort to execute and analyze dubious executables. Furthermore, it is difficult to define rules of malicious behavior, and lower high false positive detection rates. Malware detection technologies based on machine learning have been investigated in order to address these drawbacks [1], [2], [3].
Machine learning for malware detection explores classi-fication rules based on feature vectors or employs similarity based metrics for classifiers. In general, classification prediction is accomplished through the learning process by discovering hidden pattern rules. Such detection methods can also distinguish intrinsic but hidden patterns among benign and malware. However, in order to ensure a robust analysis using machine learning, a sufficient number of training examples must be collected. For malware detection systems, feature extraction methods such as opcode (operation code) [4], [5], function call graph [6], [7], string signature [8], entropy [9] and byte n-gram [10], [11] have been studied. Low-dimensional features has been studied because features with high-dimensional presentation need a huge amount of training time.
Malware developers are constantly creating new malware or employing variant technologies to avoid detection by antivirus software. As a result, the number of malware families grows year after year, raising issues that increase time complexity and space complexity to analyze malware variants. In the case of a large amount of training data, there is a high possibility of redundant and noisy data, which increases the complexity of learning model, and a large amount of training time is required [12], [13]. These issues have been targeted to select a small set of prototypes which can replace the original dataset [13], [14]. Therefore, classification based on a set of prototypes can achieve comparable performance to the entire dataset while eliminating inessential data and reducing time complexity. [15], [16].
Prototype selection methods employ similarity metrics and class labels from the dataset [17], [14]. The similarity among instances is measured using Euclidean distance, Manhattan distance, Mahalanobis distance, and so on [18]. A prototype represents a subset of instances that are placed at a constant distance in the same class. A selected prototype covers as many instances of the same class as possible and becomes a new training instance for classification models. Previous research has employed hyperspheres [15], [19], [20], and hyperrectangles [21] as prototype selection approaches to divide multidimensional space into subspaces and select a small subset of instances to replace the entire dataset.
This study proposes a low-dimensional feature representation of fixed size based on histogram entropy, as well as a prototype selection method for large-scale malware datasets based on hyperrectangles. The contribution of this study is as follows: • Two dimensional (2D) histogram entropy map is designed to characterize malware for statistical analysis. The feature is a low-dimensional feature extraction method based on entropy information and a fixed size. • A prototype selection method is proposed on the basis of hyperrectangles that select a small set of prototypes which machine learning algorithms can learn instead of the entire dataset. • The process of extracting features can be visualized to identify key patterns for malware detections. • Experimental results show that it provides comparable performance for machine learning algorithms with only a relatively small new dataset generated from the entire dataset using our prototype selection method. This paper is organized as follows. Section II discusses related work. Section III addresses data collection and feature extraction methods. Section III-D proposes the prototype selection algorithm. Section IV evaluates learning models for identifying malware. Finally, Section V concludes this paper with future works.

II. RELATED WORK
Feature engineering for malware static analysis makes use of opcode or byte data from executable binaries. Feature extraction using opcode must include a disassembly phase, but this phase has the limitation that packed and obfuscated parts may result in invalid and incorrectly assembled code [2], [3]. As another way, an entropy based feature representation has been chosen to quantitatively compare the entire structure of malware at the byte level [22], [23], [24], [25], [26], [27].
Various representations of static features have been proposed for malware detection: n-gram byte feature, entropy or hashing feature for binaries and n-gram opcode feature, DLL call or API call graph feature from assembly code, etc. However, if the feature space of these malware becomes too large, the feature vector size will change. This tends to make feature engineering more difficult. Efforts have been made to convert malware into fixed size data to make malware features robust. Examples include a 2D grayscale image, window entropy map [27], histogram entropy map [23], and hashing based map [28]. The values calculated by applying a sliding window over an executable file were integrated to represent malware features. Table 1 summarizes the studied approaches including classification type (Class), feature type, detection model and the details on the datasets in use. The datasets are federated from the known datasets or self-collected datasets. For example, there are Microsoft Malware Classification Challenge (BIG 2015, [29]), Malica-project [30], Virus Total [31], Vx Heaven [32], VIPRE [33], MalImg [34], Malwares [35], etc. The classification type is defined as a binary classification for malware detection or as a k-class classification for malware family identification. The feature type is categorized into one of data, entropy, or image driven feature engineering.
Data driven feature engineering was studied by Burnap et al. [36] and Fan et al. [37]. A dynamic analysis of the data collected through the Cuckoo sandbox [43] was performed, and the feature vectors were made up of file access log, registry key access, process execution, packet log, and usage patterns of CPU and memory [36]. There have also been studies that divide the data into a certain number of chucks and run binary codes using clustering and classification to analyze it dynamically in order to reduce overhead with a large amount of malware data [41]. The winner of the SOFM (Self-Organizing Feature Map) model of the input feature vector was selected, and its class was predicted with the closest class to the BMU (Best Matching Unit) through Researchers investigated entropy driven detection because malware identification becomes difficult due to encryption, packing, obfuscation and polymorphism [9], [44], [39]. The entropy values of malware belonging to different malware families tend to differ significantly. Lyda et al. [9] suggested the entropy analysis method by examining the statistical difference among executables. They utilized the confidenceinterval based method by calculating the amount of statistical variation of bytes in a data stream and summing the frequency of each observed byte value in a fixed length data block. They found that higher entropy values tend to correlate with the presence of encryption or packing.
Sorokin et al. [44] proposed the structural entropy approach divided files into segments: executable code, text, and packed area. Each segment was characterized in terms of size and homogeneity by entropy information. First, the wavelet analysis was used to divide the file into segment sequences of varying entropy levels. The next step detected malware by calculating the Levenshtein distance between segment sequences to determine the degree of similarity. Han et al. [39] converted PE (Portable Executable) files into bitmap images and compared the entropy changing tendency. Their analysis identified the malware family by comparing the similarity of both two entropy graphs of the test malware and of the previously known malware family. The database consisted of 1,000 malware of 50 families from Vx Heaven and an accuracy was approximately 98.0% when the threshold was 0.75.
Nataraj et al. [22] reshaped malware binaries into 8-bit grayscale images based on their file size range. The grayscale image was converted into GIST feature vectors by using the Gabor filter to compute local feature maps. All of the local feature maps were combined into a single GIST feature, which was then downsampled to a fixed size training instance. Using k-NN among MalImg's GIST feature vectors, they reported the detection rate of 97.2% for malware family identification. Gibert et al. [25]  Ni et al. [28] proposed the MCSC (Malware Classification using SimHash and CNN) approach. They decreased feature extraction time by selecting the main blocks only because it took a long time to extract all opcode as features. The main code block tends to include malware behavior information as well as the CALL instruction. The opcode sequence differs depending on the size of the malware file and is hashed to generate a binary vector of the fixed size. Thus, the sum of weights of all binary vectors in the sequence is calculated, and the weight sum vector is converted into a 16 × 16 VOLUME 4, 2016 image. The MCSC performance reported an accuracy of about 87.0% for the CNN model with BIG 2015.
Dey et al. [26] proposed a detection method for improving Natarj's image driven algorithm with entropy filtering for 2D image transformation [22]. The variants of metamorphic engines can avoid detection by anti-virus programs based on signatures and primary obfuscation techniques that disguise malicious commands. This method, however, leaves suspicious patterns at the bit level. This can be identified through entropy calculations. The local entropy value of the gray image determines the structure of entropy filtering in response to an entropy image. The k-NN classifier experiment produced slightly better results than Natarj's method [22].
Hu et al. [42] used opcode to do static analysis to compensate for the limitation of dynamic analysis, called MutantX-S. Their static-feature-based approaches are far more scalable than their dynamic-feature-based approaches. They converted malware binaries into an opcode sequence, allowing n-gram features to be extracted more quickly. With a linkage clustering and a prototype-based nearest classification [41], Rieck et al. addressed the scalability issues in terms of run-time performance and memory requirement. Their incremental approach was proposed for behavior-based analysis of malware classifications, which could handle the behavior of thousands of malware per day.
Multiple features for malware detection in Ahmadi et al. [24], Saxe et al. [23], and Euh et al. [27] were proposed. These features were built using data, entropy, and images. Ahmadi et al. [24] proposed the malware family detection with combined features from hex dump-based features, assembled code features and entropy images. They applied the XGBoost [45] to BIG 2015 through 5-way cross-validation. Each independent feature demonstrated 75.6% to 99.1% accuracy, and the entire collection of features, including the entropy feature, demonstrated approximately 99.8% accuracy. Saxe et al. [23] designed a four-layer neural network (1024 × 1024 × 1024 × 1) to detect malware and benign. The final feature was composed of byte entropy histogram, PE import and meta-data, and string data. Their prediction results of the learned model were calibrated through the Bayesian method. The detection rate was 95.0% for all the integrated feature vectors of the prepared PE Import, byte entropy, metadata, and strings. Euh et al. [27] employed tree ensemble models for 2-gram, gram matrix, WEM (Window Entropy Map), API-DLL, and API from executable and disassembled files. Their features were designed to reduce the original feature dimensionality and decreased the time complexity of ensemble models. For each proposed feature, they compared the performance of AdaBoost, XGBoost, Random Forest, Extra Trees, and Rotation Trees. WEM's XGBoost performed best with 98.0% in terms of accuracy and AUC-PRC evaluation.

A. MALWARE DATASET
Our proposed method is evaluated with BIG 2015 and the Malwares dataset, where the total size is about 115 GB. Each instance includes its own assembly code and binary file. Table 2 and 3 show the number of data and information on the test datasets. Each malware family contains at least 42 (0.4%) of instances and up to 2,942 (27.1%) of instances ( Table 2). The dataset for malware and benign classifications was collected from Malwares.com [35]. The number of malware is 65,704 (76.7%) and the number of benign is 20,000 (23.3%) ( Table 3). The benign dataset is also used for a malware detection problem with the BIG 2015 dataset.

B. HISTOGRAM ENTROPY
As malware vectorization, a 2-gram feature showed excellent performance, but has a tendency of high dimensional elements to represent a single malware [24], [38], [27]. If the data dimension increases, the input space increases proportionally, resulting in a sparse distribution. Additionally, the number of model parameters increases and a training dataset should consist of sufficient instances in order to construct a robust learning model. We design a low-dimensional feature using histogram entropy information of byte sequences. Fixed length and low-dimensional malware vectorization takes advantage of reducing training model complexity, preventing overfitting, and expecting high generalization performance. Figure 1 illustrates the process of generating our histogram entropy feature from an executable through applying a sliding window and computing histogram frequency and entropy. Figure 1 (a) is the 2D image of an Obfuscator.ACY instance which is shaped with N ×L through applying sliding window size L and stride size s. Figure 1 (b) is the same representation of Figure 1 (a) in hexadecimal. The actual size of the input image is 1, 469, 952 × 1, 024. The k th window is represented by vector w k .
The bin entropy e (k) j of the j th bin of the k th window is calculated by the Shannon entropy.
Every bin entropy e j ) returns the coordinates where e (k) j will be added. Therefore, the malware representation becomes a 2D array. M represents the degree of uncertainty accumulated in bin (horizontal directions) and L (vertical direction) to construct a 2D map of an executable file with a fixed size. In addition, the distribution of the vertical direction expresses the change by level on the horizontal direction. A fixed size feature configuration is required and a preemptive condition for applying various machine learning algorithms.

C. HISTOGRAM ENTROPY VISUALIZATION
Visualization provides one way to identify key patterns in analyzing malware. The analysis phase considers the number, location, and shape of peaks appearing on the histogram and place high weights on the largest peaks [46].    There is a continuous and distinct change in entropy histogram according to the x-axis. They decrease at first, then increases in the middle, and tend to decrease gradually afterward. It is evident that the changes between neighboring bins gradually decrease or increase.   The histogram changes in the same malware family appear with its own unique patterns and these patterns are very similar. This is because this change is reinforced when the degree of disorder of Shannon entropy is high, and the change in low entropy is poorly expressed. The entropy change of Obfuscator is distinct from other malware families, and the change of entropy value is expressed at a high level. That is, the change in entropy on the x-axis is analyzed in the form of continuously increasing or decreasing, and high peak values do not appear. In the pattern of other malware families, high peak values appear repeatedly, and there are patterns showing low changes in other parts. Simba and Vundo show similar patterns, but Vundo displays repeated and fluctuated patterns within some range.

D. PROTOTYPE SELECTION APPROACH
To solve problems arising from learning through large-scale malware, we propose a prototype selection algorithm via building hyperrectangles. A hyperrectangle is determined as a partial area within the homogeneous class distribution and includes the same class instance. The selected set of prototypes preserves the class distributions and constructs a new training dataset.

2) Hyperrectangles embedding homogeneous data
A hyperrectangle takes an area of the input space which includes some training instances in D. A hyperrectangle is usually defined with d coordinate points. Alternatively, a hyperrectangle h is represented only with the maximum and minimum coordinates and instance index set: h =< h max , h min , I >. The distance between x ∈ R d and h is calculated by Equation (1).
Here, mid = 0.5(h max + h min ) and r = 0.5(h max − h min ). The index of x is appended to I if dist(x, h) ≤ 0. Otherwise, x exists out of the hyperrectangle h. dist(x, h) becomes a distance measure that determines whether x is located within the input space represented by h.
Hyperrectangles separate the input space into smaller regions, where each region contains some instances within the same class. Two hyperrectangles can overlap or include the same instances. Let s(h|D) stand for a covering set of a hyperrectangle h from D.

3) Prototype selection algorithm
The solution of PSA is to find a small set H opt from H (|H opt | |H|) satisfying D = ∪ h∈Hopt s(h|D). After generating H, H opt becomes the solution of a set covering problem [47]. So, the greedy approach is generally chosen to find the solution H opt .
For given D and θ, instead of generating H, an improved greedy method of finding a solution is adopted by construct-ing hyperrectangles one by one. PSA gradually approach the final solution H opt through random selection. Randomly selected instance expands its coverage area while finding and storing instance indexes of the same class included within the distance according to Equation (1). Algorithm 1 is the pseudocode of PSA. The input parameters in PSA(D, θ) are the training dataset D and parameter θ, and the output is the set of hyperrectangles. A random number is used to shuffle the order of the instances in D. H denotes the set of hyperrectangles to be constructed as a solution and is initially empty. C is the set of instance indexes existing in the hyperrectangle set H. If i ∈ C, (x i , y i ) ∈ D for (x i , y i ) ∈ D do 6: if i ∈ C then 7: // Generate a new hyperrectange 8: h 10: for (x j , y j ) ∈ D and i = j do 11: if y j = y i and j ∈ C and dist(x j , h) ≤ θ then return H 23: end procedure has already been covered by a certain h ∈ H. Initially, C is the empty set.
The outer loop selects a candidate hyperrectangle index and the instance indexes that h ∈ H covers are added one by one at the inner loop. If the index i of (x i , y i ) ∈ D of the outer loop does not belong to C, the inner loop starts to compose a new hyperrectangle h.
where I i includes the index i. The inner loop expands the coverage area of h by searching for a new (x j , y j ) that has not been included yet. The selected (x j , y j ) is j ∈ C and satisfies dist(x j , h) ≤ θ, and at the same time, update h max i and h min i through elementwise operations. An a new element of h, the index j is added to I i .
When h is created from all j ∈ C and i = j that have not been covered yet, the inner loop terminates, and h becomes a member of H, and every element of I i is included to C. The same process is repeated for the selected instances in the next outer loop. If the size of C is equal to |D|, the algorithm returns H as the final solution.
A new training dataset is generated from H. A training instance of h =< h max , h min , I >∈ H is considered as the mean or median of all instances in I. New instances by the mean divide the sum of all instances in h by the number of elements. Meanwhile, the median of h is the coordinate average of h max and h min . A new instance has a one-toone correspondence to its own hyperrectangle and can be not placed outside the subregion by h. Therefore, the distribution of the new dataset created by H is comparable to that of the original dataset. In addition, the class boundaries induced by a machine learning algorithm from the new dataset become similar to those learned from the original dataset.
By dividing the input data space via hyperrectangles, a small number of new training data can be generated while maintaining the distribution of class data. The total number of new training instances is equal to the number of elements in H. Moreover, the size of the new training dataset is affected by θ. When class instances are mixed and distributed, the number of selected hyperrectangles increases, whereas when class areas are kept isolated, the number of hyperrectangles tends to decrease. The preselected θ is also a factor in determining the number of selected hyperretangles and their coverage areas.

4) Algorithm comparison
We compare and analyze the PSA performance with prototype selection algorithms using hyperspheres. Interpretable prototype selection [15] (IPS) constructs a hyperspheres that divide class areas using a distance measure and a fixed radius. PSA proposes an optimization technique for selecting a small number of prototypes containing all possible training data. The technique employs a stepwise algorithm that transforms the prototype selection problem into a set cover optimization problem and selects prototypes from each class independently. However, the prototypes contain instances of other classes and the radius of the hyperspheres is preselected through the prior experiments. Prototype based learning [48] (PBL) adjusts the radiuses of prototypes by taking into account the classes of instances which a prototype can cover. PBL does not include heterogeneous instances within potential prototypes, as it manages the radiuses of hyperspheres in constructing covering sets. Figure 6 is the examples of the prototype selection algorithms. The data in this experiment is randomly generated data in this experiment and the total of data is 900. Figure 7(a) is an example of selected prototypes with a fixed radius (r = 0.1). A total of 56 prototypes are selected. Figure 7(b) is a hypersphere of variable radiuses. The number of data within the prototype domain is more than one, with a total of 110 prototypes selected. No prior definition of radius is required because the radius is set by considering the different classes VOLUME 4, 2016 of data within the region of each prototype. Figure 7(c) is an example of our hyperrectangle based prototype selection (HRPS) and a number of 47 prototypes are chosen with θ = 0.4. Figure 7 compares IPS, PBL and HRPS in terms of θ, data size and time complexity. The test problem consists of three classes of 2D data that are generated at random between 300 and 3000. Figure 8(a) compares the number of prototypes selected as θ changes. As θ increases from 0.1 to 1.0, the coverage area of a hyperrectangle expands, resulting in fewer prototypes selected. When a small θ is set, a large number of prototypes are selected. So, the maximum number of selected prototypes is equal to the size of the training dataset. A new training dataset, consisting of a small number of prototypes, should be constructed while reflecting the distribution of the input data space. Therefore, it is necessary to find an appropriate θ value in order to improve generalization performance. Figure 8(b) compares the execution time. IPS and HPRS takes lower time than PBL. The reason is because PBL takes time to find hyperspheres with variable radiuses, requiring much computation time. HRPS has a prototype selection time similar to IPS and the runtime remains nearly constant even though the number of instances increases. In the method of dividing the class input space, the method of extending the coverage area of IPS is simpler than that of HRPS.

IV. EXPERIMENT
Model experiments were performed on a computer with Intel Xeon(B) Silver 4120 CPUs and Nvidia GPU. The computer supports 256 GB of main memory and 2 CPUs at 2.20 GHz. The GPU model is Tesla V100 and has 32 GB of memory. The GPU was used to compare the training time of CNN models.
The proposed malware feature is compared to the test results of Decision Tree (DT, [50]), Random Forest (RF, [51]), XGBoost (XGB, [45]) and CNN [52] algorithms. The final learning model was determined through the 5fold cross-validation for the BIG 2015 dataset. We adopted DT and RF models from scikit-learn [53], XGB from XGBoost [54] and CNN from Keras [55]. The whole experiments were conducted by 50 times per cross-validation and analyzed the average of the mentioned metrics for objective comparison.

A. ASSESSMENT METRICS
The chosen metrics of the malware detection system are accuracy, recall and precision, balanced accuracy and F1score under 5-way cross-validation [49]. The predictive result of each malware family was evaluated and their average was analyzed for the overall performance. N is the size of the test dataset, l = {1, · · · , c} is the class label, N l is the number of instances in each class. Letŷ i be the prediction result for the i th instance of the test dataset and then c = 9 for the BIG 2015 dataset. Each evaluation metric is defined as follows.
Accuracy measures how correctly a model predicts test instances where the basic unit is a single instance. Each unit is weighted equally to the model accuracy.
A c-class classifier has a tendency to focus more on classification learning of majority class data rather than minority class data. This makes it difficult to objectively evaluate when the class data is imbalanced. Balanced accuracy can alleviate this problem and is equal to the sum of the proportion of correctly predicted instances divided by the number of classes. This metric is less sensitive to the majority class and gives high weight to data from minority classes. Therefore, the difference between balanced accuracy and accuracy appears when the test dataset shows an imbalanced distribution over the classes.
When evaluating a c-class classifier, the precision and recall of class l is computed from the prediction result.
Precision pre l of class l is the ratio of the number of instances correctly answered by class l to the number of instances predicted to class l. Recall rec l of class l becomes the proportion of the correctly predicted instances to the total number of instances in class l. Precision indicates the correctly predicted proportion of the predicted class data, while the recall analyzes the correctly predicted proportion of the class data. It is a measure that compares the analysis evaluation of a correctly classified specific class through the same class and another class. When both precision and recall are close to 1, the generalization performance of the training model is highly regarded. The overall precision and recall of a c-class classifier is calculated as follows.
rec l F1-score is the harmonic mean of all the precision and recall values. F1-score calculates the overall average of precision and recall, since the numerator consists of values in the range [0, 1]. This implies that the influence of the majority class has the same importance as the minority class. The high F1-score indicates that the predictive model has good performance, whereas the low F1-score means that it is a poor model. The DT structure was decided through the preliminary experiments with the subsample dataset of a sampling ratio of 30.0% at random. While deciding the DT structure, the tree depth increased from 5 to 30 by step 2. The node splitting criterion of DT uses Shannon's entropy, the minimum number of a node's instances is set to 4. The internal nodes were applied to split if their number of instances was more than 10. When splitting nodes, the same number of features is checked.
RF and XGB employed 100 decision trees where each tree structure was the same as DT. Similarly, the number of DTs was chosen by changing the number of DTs from 50 to 200 by adding 10 each through the preliminary experiments. When learning RF, all DTs were trained with only 80% of the selected data. The learning rate of XGB was set to 0.05 and the conventional gradient decision tree was chosen. To avoid overfitting, the sample ratio in decision tree construction was 0.7, implying that XGB randomly selects 70.0% of the VOLUME 4, 2016 training data prior to growing trees. From the related works, the CNN architecture consists of 7 layers as shown in Figure 8. There are 3 convolution layers, one max pooling layer and 3 layers of the fully connected layer. The 5 th and 6 th layers are composed of ReLU (Rectified Linear Unit) nodes, and the nodes of the output layer adopted a softmax activation function. The maximum epoch of training was 100 and the mini-batch size was 256. The input layer of the fully connected neural network is configured to prevent overfitting using a 30.0% dropout strategy.
The shape of the proposed 2D feature representation depends on θ and L. To determine the optimal θ and L values, the grid search method was used for repeated experiments. L changes the discrete value by increasing by 1 from 1 to 20, and θ decreases from 1 to 10 −5 . For a given L, θ is set to θ = 2 − k 2 × 5 k 2 −k , k = 0, 1, . . . , 20. As L increased, the continuous improvement was analyzed, but after L ≥ 6, the performance improvement of all models was insignificant. For each L, when θ = 1, the prediction performance showed less than 50%. After θ = 10 −5 , there was no performance improvement, and all training data were determined as prototypes. Therefore, the L values of 1, 2, 4, and 6 were chosen for the visualization analysis. Table 4 and Figure 9 compare the performance of the learning models when θ = 0.01 and L changes. As L changes from 1 to 6, the performance of each model tends to increase. For XGB, RF, and CNN, all the metrics are increasing as L increases. Furthermore, at L = 1, these learning algorithms achieve above 96.0% and reach 100.0% at L = 6. However, in DT, as the level of L increases, the accuracy index shows a tendency to increase, , but the changes in recall, precision, and balanced accuracy were observed. When L = 6, all mod-els achieve their best performance, which is around 98.5% for all evaluation metrics except the DT model. Overall, RF and XGB outperform DT, however the difference in performance is just approximately 5.0% at most.
In comparison of precision, recall, and F1-score, CNN shows a little higher than RF and XGB. When comparing the generalization performance of CNN with that of other algorithms, it shows about 2% to 5% with DT, and a difference of about 10 −2 with RF and XGB is analyzed. The difference between CNN, RF, and XGB is negligible at all L values.
On the other hand, CNN requires significantly more time to train than the other algorithms (Figure 9 (f)). The training time of XGB places 2 nd . But the training time of CNN exceeds 10 times than that of XGB except the case of L = 1. DT shows the shortest training time, but RT requires twice as much time as DT.
We evaluated both CPU-based CNN (column CNN in Table 4) and GPU-based CNN (column GPU in Table 4). The two experiments were evaluated similarly, but the training time of the model using the GPU was approximately 9 to 18 times faster. Furthermore, GPU-based CNN was trained faster than XGB in terms of time complexity. As a result, in terms of metrics and training time, GPU-based CNN outperformed XGB.
In the experimental evaluation, the ensemble approach based on the decision tree shows slightly lower performance than CNN, but is analyzed much higher than DT. The generalization performance of RF, CNN, and XGB from the evaluated scales show high robustness. We found that RF and XGB are more effective at malware classification using the 2D histogram entropy because of their low time complexity and high generalization performance.

C. MODEL COMPARISON FOR MALWARE DETECTION
The binary classification for malware detection was conducted on both 2-class BIG and 2-class Malwares datasets. The 2-class BIG consists of about 85,000 malware and benign collected in Table 3. In BIG 2015, the Malware Challenge dataset does not include benign examples, so we include the benign dataset of the Malwares dataset to define the classification problem (2-class Malwares) which consists of 30,868 malware and benign.
The Prototype selection rate from 2-class BIG was about 70% (58,878) and analyzed as θ = 0.01. This θ value was the same value found in the Malware family test, and the addition of benign data did not affect the optimal θ value. In the 2-class Malwares dataset, when θ = 0.0001, about 49.11% (14,816) prototypes were selected. Table 5 compares the experimental results of 2-class BIG and 2-class Malwares. The result of 2-class BIG is L = 6, and 2-class Malwares is L = 4. From the 2-class BIG results, the performance of all models including DT was analyzed to be higher than 99%. In particular, CNN using GPU approaches 100% in all performance indicators, but requires about 10 times more training time. For 2-class Malwares, the precision, recall, and F1-score of DT do not reach 90%, but the ensemble model approaches 95%. CNN's precision is about 90%, but recall is 88%. CNNs required several times the training time due to the huge amount of training data. In both experiments, XGB shows higher performance than other models, and is analyzed as a more robust model for malware detection problems.
Based on various thresholds, the precision-recall (PR) curve diagnoses the impact of precision and recall rates on malware classes, whereas the ROC examines the trade-off between false positive and true positive rates in terms of malware and benign instances within the test dataset. Figure 10 shows PR AUC and ROC graphs for 2-class BIG and 2class Malware. The results prove that the proposed method is effective for malware detection analysis because it does not cause overfitting and the effect of class imbalance. For both the 2-class problems, DT showed the lowest AUC, but showed the highest performance in the order of XGB, RF, and CNN. This trend was similar to malware family detection. The PR AUC and ROC AUC of 2-class BIG are close to 1.0 for XGB, RF, and CNN. The PR AUC of the two types of malware was analyzed to be excellent in the order of XGB, RF, and CNN, and the ROC graph shows the same trend.

D. ANALYSIS OF THE EFFECT OF PROTOTYPE SELECTION
A new dataset generated by selecting a prototype is analyzed for its suitability through the case of RF. The model parameters were the same as in Subsection IV-B. The size of the new training dataset is influenced by the number of prototypes, which is determined by parameter θ. Therefore, we compare and analyze between the size of new datasets and RF performance as θ and L change. This type of evaluation can compare the relationship between the size of prototypes and learning model.
Without loss of generality, a new dataset was generated by scaling the training data to the range [0, 1] and decreasing θ from 0.5 to 0.00005. We compared prototype selection ratios and the accuracy of RF according to the change of L and θ (Figure 10). Because the volume of hyperrectangles decreases as θ decreases, the number of selected prototypes approaches closer to the size of the original dataset. In the case of L = 6, when θ changes from 0.5 to 0.01, the accuracy increases in proportion as the number of selected prototypes increases. The number of prototypes increases from 0.005 to 0.00005, but the effect on accuracy is insignificant. The best case occurs at L = 6 and θ = 0.01 when considering the number of selected prototypes and generalization performance. Figure 12 compares the performance of the malware family detection by RF when θ = 0.01 and L = 6. The average prototype selection rate for each malware family is less than 40.0%. However, the detection rate of Simda is 87.5%, but the detection rate of other families exceeds 95.0%.
The prototype selection rate for Ramnit, Lollipop, Tracur and Gatak is 56.59 % to 66.06 %. The result shows similar or higher performance than before the prototype selection algorithm was applied. We note that the prototype instance representing the class similarly reflects the original class data distribution. In addition, it is expected that the boundaries between malware families are distinguishable to some extent. Kelihos_ver3, Vundo, Kilihos_ver1 and Obfuscator.ACY show relatively low prototype selection rates ranging from 13.6% to 41.9%, and similar classification performance. It is anticipated that the instances of these malware families are clustered together and that several family groups are dispersed throughout the feature space.
The performance of Simda is 2.4% lower than that of the others but the prototype selection rate is around 93.0%. The number of Simda instances in the original dataset is too small (0.4%) to reflect the data distribution only with gathered instances. The selected prototypes do not contain sufficient information on its malware family distribution. The same analysis can be considered for Rammit, Lollipop, Tracur, and Gatak.

V. CONCLUSION
As malware variants increase, both the time and model complexities are raised for malware classification. To address these challenges, this paper proposed an integrated system of both the fixed size feature design and the prototype selection method based on hyperrectangles. Unlike the previously studied high-dimensional malware features, the histogram entropy benefits from low dimensions, reducing learning time and avoiding overfitting. The hyperrectangle based prototype selection method generates a smaller dataset with more meaningful instances from the original dataset. As a result, the approach can save storage space and training time while  DOOSUNG HWANG is a Professor in Department of Software Science, Dankook University, South Korea. He received his Ph.D. from Wayne State University, USA. Previously, he was a senior researcher at ETRI (Electronics and Telecommunications Research Institute), South Korea, and worked on learning algorithm design and intelligent systems such as expert system, image recognition, time-series analysis, and parallel computing. VOLUME 4, 2016