MSIC: Malware Spectrogram Image Classification

The heavy reliance on digital technology, by individuals and organizations, has reshaped the traditional economy into a digital economy. In response, cybercriminals’ attention has shifted dramatically from showing off skills and conducting individual attacks into high sophisticated attacks with financial gain as the goal. This, inevitably, poses a challenge to the cybersecurity community as they strive to find solutions to preserve the confidentiality, availability and integrity of the individual users’ and corporates’ private data and services. Cybercriminals mainly deploy malware to achieve their goals, which could be in the form of ransomware, botnets, etc. The use of encryption, packing and polymorphism techniques makes it harder to detect the malware files, especially when these are created in great numbers every day. In this paper, a novel framework, named Malware Spectrogram Image Classification (MSIC), is proposed. It employs spectrogram images in conjunction with the convolution neural network to classify a malware file to its corresponding family and to differentiate it from a benign file. Further, this research shares with the research community two privately collected labeled malicious and benign datasets. The evaluation of MSIC showed its effectiveness to be 91.6% F-measure and 92.8% accuracy in classifying malware files to their corresponding families, in comparison to, respectively, 90.6% and 92.3% results produced by the grayscale image classification approach. Likewise, in classifying files as malicious or benign, MSIC scored 96% F-measure and accuracy results compared to 95.5% with the grayscale solution. Also, MSIC required less computational time in converting and resizing the files than the grayscale framework.


I. INTRODUCTION
The rapid evolution of digital technology in various areas has led to its integration in such personal and corporate activities as personal banking and e-commerce. As a result, the traditional economy has transformed into an Internet economy. Google and Temasek research program reported that Southeast Asia's Internet economy reached US$100 billion [1]. Increasing reliance on digital technology and the recent transformation of the economy have caught the attention of cybercriminals and they have changed their intentions from individual attacks into well-organized attacks with financial gain as the objective. The total cost of cybercrime for each company increased from US$11.7 million in 2017 to a new high of US$13.0 million-a rise of 12% [2].
Malicious Software (Malware) is the primary tool that attackers use to gain their end. The form of the malware varies according to the purpose of the attack. It could be The associate editor coordinating the review of this manuscript and approving it for publication was Mohamed Elhoseny . adware, worm, ransomware, spyware or botnets. A botnet is the main platform used nowadays [3]. McAfee's security report found that, in the last quarter of 2018, more than 60 million new malware samples have been identified [4]. The creation of such a huge number of malicious executables is mainly caused by the advancement in malware implementation that utilizes packing, polymorphic and metamorphic techniques [5]. In addition, a large quantity of newly released malware families are not coded from the scratch, whereas they were rewritten to be a variant of a previous released family [6]. Symantec found that the number of the new malware families has dropped as the cybercriminals are modifying existing ones [7]. For example, Fox IT found that Tilon malware is linked to Spyeye and Zeus malware family [8]. These findings are posing a clear evident of the importance in identifying malware files, classify them to their families and prevent them from being executed at the victim's device.
Motivated by this evident threat, information security society has proposed security measures to detect malware and prevent their attacks. Static based detection aims to find VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a signature in the code by reverse engineering the executable and match it with an updatable database to find a match. Dynamic based detection aims to execute the malware and identify malicious behavior, without the need to reverse engineer the malware. Image based detection, which is a recent approach in the literature, aims to detect malware by converting it to an image and classifying the image as malicious or benign. The latter approach, unlike static and dynamic analyses, does not require domain knowledge for malware detection. This research was motivated by the need to identify the huge number of malware files that cybercriminals create and classify them to their corresponding families. Our paper provides the following four contributions: • We propose a novel framework named Malware Spectrogram Image Classification (MSIC) to distinguish malicious files from benign files and classifying the malware files to their corresponding families by utilizing voice spectrogram analysis in conjunction with deep learning algorithms.
• Evaluation analysis of MSIC is conducted for different deep learning parameters to classify malware files to their corresponding families and distinguish them from benign files. The results are compared to the grayscale classification framework that has been used by researchers in the literature.
• The research provides the information security community with a labeled malware dataset containing 11 malware families.
• The research provides the information security community with a labeled malicious-benign dataset that has been collected from the Internet at our private lab.
The rest of the paper is organized as follows. Section II reviews the various classification models that have been proposed by researchers in the literature. In section III, the proposed framework, MSIC, is introduced, along with a discussion of audio signals and Convolution Neural Network (CNN). Section IV introduces the experimental environment and the evaluation metrics used. Section V explains the malware dataset collection process and presents the results of the evaluation of MSIC and the grayscale frameworks for classifying the malware files to their corresponding families. Section VI describes the collected malicious-benign dataset, along with the experimental evaluation of both the MSIC and the grayscale frameworks for distinguishing the malicious and benign files. Section VII concludes the research and points to the directions of possible future work.

II. CLASSIFICATION MODELS
Researchers have been striving to frustrate the malicious intentions of cybercriminals. The following subsections shed light on the different techniques that have been proposed to distinguish malicious-benign files and to classify malware to their corresponding families.

A. STATIC ANALYSIS TECHNIQUE
Static analysis detection works without the need to execute the malware. The solution is to first reverse engineer the executable. Thereafter, a signature is extracted from the malware source code and compared with a database that is updated periodically. This solution usually integrates machine learning algorithms in the classifier.
The authors in [9] were among the first researchers to use machine learning in building static analysis classifier to detect malware based on features such as program headers, strings and byte sequence. Their work was improved upon by the use of n-grams of byte codes as features of the classifier [10]. The researcher in [11] utilized opcode sequence with support vector machine algorithm to identify malicious executables. The Application Programming Interface (API) sequence in the code was used in detecting malware and proved to be effective and faster as compared to assembly analysis [12]. MalConv classifier detects malicious executables using their raw bytes feature as input for a fully connected neural network [13].
Static analysis suffers from many shortcomings. The solution requires domain knowledge of the executable architecture to build the classifier. Further, the information like size of data structures or variables gets lost thereby complicating the malware code analysis [14]. In addition, static analysis performs poorly in detecting unknown malware. Although some static analysis techniques, such as byte n-gram, do not require domain-level knowledge, they suffer from low performance and high computational requirements [15].

B. DYNAMIC ANALYSIS TECHNIQUE
Unlike the static analysis, dynamic analysis addresses the behavior of an executable by executing it and identifies the features accordingly. Such features include registry changes, memory writes, API and system calls. Usually, the dynamic analysis takes place inside a virtual machine.
The work in [16] differentiated malicious binaries from benign binaries by monitoring API calls and applying n-gram technique. The authors in [17] extracted API calls using a 5-minute time window. Then, they utilized Recurrent Neural Network (RNN) to build the classifier and CNN to evaluate the classifier. This solution attained 96% for area under curve accuracy measure. A solution to detect malware by monitoring the network traffic was proposed by the authors in [18]. They were able to reduce the detection time by 67%, compared to conventional methods. In [19] the authors combined RNN and CNN to perform hierarchical feature extraction, and used n-gram technique to select appropriate opcodes for malware detection. In [20], the authors conducted a comprehensive comparison between static and dynamic analyses and a hybrid solution using a large number of malware families. They found that dynamic analysis achieved the most accurate results.
Dynamic analysis is free from such drawbacks of static analysis as in detecting unknown malware. However, it requires higher computational resources and more frequent occurrence of false positive results.

C. IMAGE PROCESSING TECHNIQUE
The recent technique in classifying malicious files is the utilization of image processing algorithms. Malware raw bytes are converted into grayscale images and neural networks are used to classify the image to its corresponding malware family or to classify it as malicious or benign. This technique is motivated by the fact that attackers generate many malware variants from existed malware families, without the need to write the code from the scratch, with the help of packing and metamorphic techniques. The image processing technique identifies and uses the layout and texture of malware to classify it to its corresponding family and distinguish it from benign executables.
The research conducted in [21], [22] addressed the effectiveness in classifying malware to their corresponding families by converting the malware into gray images. The proposed frameworks, Signal Processing Approach to Malware Analysis (SigMal) and Search and Retrieval of Malware (SARVAM), gave very accurate results. In [23], the researchers compared binary texture classification and dynamic analysis classification. The two techniques showed very similar results regarding accuracy. However, binary texture solution was faster with classification. The authors in [24] provided preliminary experimental results of the image processing technique in classifying malware, achieving 98% accurate classification. The malware dataset contained 9,458 samples under 25 different malware families. The researchers in [25]- [27] used image processing techniques to classify malware files by unpacking malware files and then representing their assembly code and opcode as images. The authors evaluated CNN, ResNet and Googlenet algorithms to classify the images to their corresponding classes. The drawback of this solution lies in the necessity of the unpacking process to generate the images. To overcome this limitation, the authors in [28] proposed a hybrid method of malware visualization in a big data environment and evaluated it against public and private malware datasets and achieved high accuracy. The solution converts raw bits of the files into grayscale images, which permits the files to be classified without the need for unpacking them. This image conversion overcomes the packing and metamorphic evasion techniques. Ensemble CNN architecture for malware grayscale image classification has been addressed by the authors in [29]. They evaluated the results of VGG16 and ResNet-50 algorithms, using the Malimg dataset. Later, the authors provided another solution called Image based Malware Classification using Fine-tuned Convolutional Neural Network Architecture (IMCFN), which converts raw malware binaries into images that are used by the fine-tuned CNN architecture to detect and identify malware families [30].
Unlike static and dynamic analyses, implementing the image processing solution does not require strong domain knowledge. Further, it overcomes the packing and metamorphic evasion techniques by not requiring reverse engineering of the file. It converts the raw malware binaries into images. Also, it is fast and works on various malware irrespective of the operating system. MSIC framework is evaluated and compared against the CNN grayscale image classification solution that has been proposed by the researchers in the literature [28]. In the latter technique, the grayscale images are prepared by converting the raw bits of the files into bit strings and grouping each 8-bit as an unsigned number to represent a pixel color and these are fed to CNN. Differing from the grayscale classification, the proposed MSIC framework converts the raw bits of the files into spectrogram images using audio signals fundamentals and feeds them to CNN. This avoids the need to extract static features, which enables it to overcome evasion techniques, such as packing, polymorphic and metamorphic.

III. MSIC FRAMEWORK
The proposed solution in this work aims to classify malware files to their corresponding families and distinguish them from benign files by utilizing an image processing technique. Such a solution does not require domain knowledge and is fast as compared to other solutions. MSIC is shown in Figure 1. It is divided into two main stages, malware image preparation and malware image classification. These two stages are conducted with the use of audio signals fundamentals and neural network algorithms.
An audio signal can be visualized as a time domain, frequency domain or spectrogram. A time-domain visualizes an audio signal in an x-y plane, showing the amplitude variations of the signal against time. The frequency-domain visualizes an audio signal in an x-y plane, showing the frequencies in the signal with their magnitude. Fourier Transform (FT) is a mathematical concept used to convert a signal from time-domain to frequency-domain. Spectrogram visualization represents an audio signal's frequency with time in an x-y plane. The x-axis represents time and the y-axis represents frequencies. The third dimension in the spectrogram is the magnitude of a frequency at a specific time, where it is represented as a colored heatmap. Spectrogram can be attained by applying the short-time FT (STFT) on the time domain signal. STFT breaks the signal into small windows and calculates the FT for each window. The literature has shown the effectiveness of using spectrogram images for speech recognition [31]- [34]. FT and STFT functions are computed using the following equations for continuous signals: where x (t) is the time-domain signal to be transformed, τ is the time, ω is the frequency, w(t) is the window function and X (τ, ω) is the complex function representing the phase and magnitude of the signal over time and frequency. Analog audio signals are transformed into digital audio signals to be processed by digital devices. To fulfill this, two criteria must be met-sampling rate and bit depth. The sampling rate is defined as the number of samples taken from the analog signal to represent a digital signal, usually measured as the number of samples per second or Hz. For example, 8 KHz = 8000 samples/second. A higher sample rate provides higher quality. Bit depth defines the number of bits used to represent each sample's amplitude. For example, wav audio files usually use 44100 Hz sampling rate and 16 bits as bit depth.
Since signals are stored as numbers on computers, discrete functions must be used. Discrete FT (DFT) is used to convert a discrete signal from time-domain to frequency-domain. Fast FT (FFT) algorithm is used to compute DFT. FFT reduces the DFT computation complexity from O(N 2 ) to O(N log N ). Spectrogram can be attained using STFT, where the FFT algorithm is computed over each window.
DFT and STFT equation for discrete signals are given below: where X (m, ω) is the STFT of the time-domain sequence (STFT {x n } (m, ω)), x n is the sequence of discretized time-domain signal to be transformed, m is the time index, ω is the frequency and w n is the sequence of the discretized window function. The first stage in MSIC framework aims to visualize the file as a spectrogram image to be used as an input to the neural network classifier. To achieve this, first, the file is converted into raw data of 0s and 1s bit string. Then, the sampling rate and bit depth parameters are defined. Thereafter, the time domain signal is generated by using the sampling rate and bit depth values identified earlier. Subsequently, the generated signal is segmented using symmetrical fixed window sizes with overlaps between them. For each segmented signal, FFT is calculated. The computed FFT over the segmented signals is the process of STFT. The calculated FFT of the different segments are constructed to provide the spectrogram image of the file, which is the representation of how the frequency content of a signal changes with time.
In the second stage, a classifier is built to classify the spectrogram images. To fulfill this, CNN will be used because its effectiveness is proved in the literature [35], [36]. CNN is considered a deep learning algorithm that takes an image, assigns importance to several aspects of the image and differentiates the image from the rest. The CNN aims to reduce the images to a form that is simpler to process and avoids the loss of features that are important for achieving a good prediction. To achieve this, the image is represented as a matrix of pixel values and a filter matrix is identified. The filter slides from the top left of the image to the right with a certain stride value, extracting features at each stride move till it covers the entire width. Then, it moves down and starts from the left of the image with the same stride value. It repeats this process to cover the entire image. The extracted features are set using the defined filter to obtain a feature map. The optimal number of filters and their size are usually identified by the parameter tuning approach. The filter uses Rectified Linear Units (ReLU), a non-linear activation function, on each element.
After the convolution layer is processed, each feature map is fed into the pooling layer. This layer aims to reduce the spatial size of the convolved feature, thus reducing the computational requirement to process the data. The pooling layer utilizes maximum, average or minimum pooling techniques. Maximum pooling was used since it selects the maximum output from the image and suppresses noise. Finally, after the output from the pooling layer is flattened, it is fed to a fully connected neural network for the classification process. The pseudo-code for MSIC is illustrated in Table 1, which summarizes the steps of spectrogram preparation and CNN building processes.

IV. EXPERIMENT ENVIRONMENT AND EVALUATION METRICS
The data collection and experimental evaluation procedures in this research have been conducted in an isolated environment to avoid any unintentional malware infection or spread. To ensure the prevention or spread of malware, the following measures were applied. First, the workstation used for data collection and experimental evaluation were not connected to any local network, which has a dedicated Internet connection. Second, a Linux virtual machine was used to control the connections. Third, all the outbound connections were blocked, except to those pre-defined addresses that were necessary for our experiment. The experiment's workstation runs Ubuntu 19.10 (64-bit) with intel core i7 7th generation, 512 GB SSD, 16

V. MALWARE FAMILIES CLASSIFICATION EXPERIMENT
This experiment aims to build a multiclass classifier to classify malicious files to their corresponding families and evaluate them against privately collected data. To achieve this, first, a labeled dataset was required to be prepared. This was done manually in our labs. Then, MSIC framework was built and evaluated. Later, the raw byte grayscale image classification framework was built and evaluated. The grayscale image classification associates each raw byte of a file to a pixel color. Black has a value of 0, white has a value of 255 and the rest of the values represent intermediate shades of gray [38].

A. DATASET DESCRIPTION
Preparing a labeled malware dataset is not a trivial task. Further, no single site provides a malware dataset with enough labeled files. Although some sites provide public datasets, such as Malimg, they are only grayscale images and cannot be used for other types of analyses besides grayscale classification. Therefore, labeled malware files must be acquired, so they can be converted into spectrogram images for the MSIC framework. For collecting and labeling the data, an isolated environment was created, in which the process illustrated in Figure 2 was followed. First, malware files were downloaded from three main sources: Malshare, 1 Virusign 2 and Dasmalware. 3 Second, the hash values of the collected files were uploaded to Virustotal website. Third, the JSON files of the uploaded hashes were downloaded from VirusTotal. Fourth, AVClass tool [39] was used to label the malware samples using the collected JSON files. Fifth, the obtained labels were mapped to the malware files. The collected dataset is summarized in Table 2 and has been shared publicly for the research community [40]. In total, 9187 malware files representing 11 malware families were collected. The data were divided into datasets, namely, 60% train, 20% validate and 20% test. The training dataset was used to build the classifier. Validate dataset was used for selecting the best parameters to be used during the building process. Test dataset was used to evaluate the accuracy of the built classifier in classifying unseen malware files to their corresponding families.

B. MSIC RESULTS AND ANALYSIS
It was necessary to represent the collected files as spectrogram images to be fed to the CNN classifier as illustrated in Figure 1. For that purpose, the files were first converted into raw bit strings. Then the sampling rate and bit depth were 1 https://www.malshare.com/ 2 https://www.virusign.com/ 3 https://dasmalwerk.eu/  set to 44100 and 16-bit signed integer respectively. These parameters were the same as the wav audio files. Thereafter, spectrogram visualization of the files has been achieved using discrete STFT with a Hanning window [41]. Figure 3 shows a spectrogram sample of emotet malware. One expected observation of the resulted spectrogram is the low periodicity behavior. This is predictable as the nature of the audio files have a higher periodicity compared to non-audio files. The resultant spectrogram images had different dimensions in terms of the pixels' height and width. The reason for this was the sizes of the files in the collected dataset ranged from bytes to megabytes. Since the input of CNN must be symmetric images, python resize functionality was used to have input images of a standard size of 200 * 200 pixels. 4 The spectrograms of all malware files were obtained and used to build and evaluate the CNN classifiers. Train and validate datasets were used to build and select the most suitable parameters for the built CNN classifiers. Data augmentation was applied to the training dataset, such as zoom, crop and flip to reduce the overfitting impact. The following parameters were tested to select for building the best classifier: The various experiments showed learning rate value 0.001 as the effectiveness, compared to 0.1 and 0.01, regardless of the number of the filters and the use of the dropout layer, throughout the 100 epochs. Therefore, 0.001 was selected for the rest of the experiment. 16 filters showed lower accuracy results with a high fluctuation loss pattern as compared to 32 and 64 filters. With 32 and 64 filters, the results showed similar accuracy and loss behavior. The number of filters used in building our classifier for resources purposes was 32. The validation results showed that the performance with the dropout layer was better than without it, regardless of the number of the filters and learning rates. Therefore, the dropout layer was integrated into our classifier. As a result, the best classifier's parameters were 0.001 learning rate, 32 filters and the presence of the dropout layer. This classifier was used for the rest of the experiment.
The validate dataset was used to compare three network structures, CNN_1, CNN_2 and CNN_3, using 1, 2 and 3 layers respectively. Table 3 summarizes the number of parameters of the three structures. Figure 4 and Figure 5 show the accuracy and loss behavior over the 100 epochs. Overall, the three structures showed good accuracy results, where CNN_1 scored the lowest accuracy during the 100 epochs. The accuracy of CNN_2 was comparable to CNN_3, with the latter having a slightly better accuracy. The loss behavior of the three structures showed a high loss pattern before reaching epoch 20. Thereafter they stabilized. Two spikes were noticed afterwards, specifically in epochs 81 and 90, where the latter had the highest loss spike throughout the 100 epochs. By the Test dataset has been used to evaluate the three built structures in classifying unseen malware samples. Table 4 lists the levels of accuracy, precision, recall and F-measure resulting from the three structures. CNN_1 attained 90.5% recall and precision, 90.4% F-measure and 91.7% accuracy. The lowest F-measure class result was in case of ramnit malware, whereas the highest F-measure class result was with icedid malware. CNN_2 at 90.2% scored less than CNN_1 recall result but scored higher precision, F-measure and accuracy with 91.1%, 90.6% and 92% results respectively. Like CNN_1, the lowest F-measure class result in CNN_2 was ramnit malware and the highest was icedid malware. CNN_3 outperformed the other two classifiers,   achieving 91.5%, 91.9%, 91.6% and 92.8% for recall, precision, F-measure, and accuracy, respectively. Ramnit class scored the least evaluation metrics, where coinhive scored the highest.
The confusion matrix of the CNN_3 structure is listed in Table 5. For the least class accuracy result, that was with ramnit, the classifier successfully classified 51 out of 67 files. However, it misclassified 7 files as razy, 5 as fareit, 3 as emotet and 1 as mirai. For the second least class, razy, the classifier correctly classified 106 out of 134 files. However, it misclassified 10 files as ramnit, 7 as fareit, 5 as emotet, 2 as gafgyt, 2 as gandcrab and 2 as mirai. The results suggest a spectrogram image similarity between the ramnit and razy malware. Of the two best performing classes, CNN_3 correctly classified all coinhive files and 294 out of 297 files of gafgyt malware.

C. GRAYSCALE RESULTS AND ANALYSIS
This experiment aims to evaluate the accuracy of the classification of the malware files to their corresponding families using the grayscale framework that has been used by researchers in the literature, e.g., in [28]. For this, first, the grayscale images were prepared by converting the files into bit strings and grouped each 8-bit as an unsigned number to represent a pixel color. As in the MSIC experiment, train and validate datasets were used to build and select the best parameters. Thereafter, the test dataset was deployed to evaluate the selected classifiers. The same CNN parameters mentioned earlier were used, except for the input shape. In this experiment, the shape used was 200 × 200 × 1 for the input. Python resize functionality was used for the purpose.
After going through many experiments, the following findings were observed. Learning rates 0.1 and 0.01 provided good accuracy and loss results for different numbers of filters. However, the learning rate at 0.001 was the best result among the three. With the use of the dropout layer, fluctuation loss behavior was reduced as compared to when not using it, regardless of the number of the filters used. The utilization of 64 filters showed better accuracy and loss results than the 32 and 16 filters across the 100 epochs. As a result, learning rate 0.001, the addition of the dropout layer and 64 filters were the parameters used for the rest of the experiment.
For selecting the network structure, three models were built and evaluated using the train and validate datasets along with the identified best parameters. The models were named CNN_1, CNN_2 and CNN_3 using 1, 2 and 3 layers respectively. Figure 6 and Figure 7 show the accuracy and loss behavior of each structure, across 100 epochs. CNN_3 showed the best accuracy, CNN_2 demonstrated better accuracy behavior than CNN_1, although it showed a drop in accuracy in epoch 97. In terms of the loss results, CNN_3 gave the best outcome with the least fluctuation behavior. On the other hand, CNN_1 showed the highest fluctuation with the worst results. The loss pattern in CNN_2 was comparable to that of CNN_3, although it had a spike in epoch 90.
The three network structures have been evaluated against unseen malware files, using the test dataset. Table 6 describes the attained results. CNN_1 showed the least performance at 85.4%, 88.9%, 85.8% and 89.5% for recall, precision, F-measure and accuracy metrics, respectively. It is noticed that ramnit class demonstrated the lowest results compared to the other classes. CNN_2 showed better results with 87.4% for recall, 89.7% for precision, 87.9% for F-measure and 90.6% for accuracy. As in the case of CNN_1, ramnit class reflected the poorest performance. CNN_3 outperformed CNN_1 and CNN_2 classifiers with 91% in recall, 90.2% in precision, 90.6% in F-measure and 92.3 in accuracy. Ramnit class's performance was better as compared to CNN_1 and CNN_2 but remained the lowest amongst the all the classes.  The confusion matrix of CNN_3 is illustrated in Table 7 for the 11 classes. CNN_3 correctly classified most of the files to their corresponding families. The lowest class results, 48 out of 67 files correctly classified, was for ramnit. Of the 19 incorrectly classified files, 8 were classified as razy, 8 as emotet, 1 as fareit, 1 as gandcrab and 1 as icedid. The next in inaccuracy was razy with 106 files out of 134 identified correctly. The incorrect predictions of it were 9 files as ramnit, 9 as emotet, 3 as coinhive, 3 as fareit, 3 gandcrab and 1 as mirai. Those results indicate the similarity between razy, ramnit and emotet malware, where they target windows OS. On the other hand, the classifier successfully predicted all coinhive files. Also, it was able to detect 91 out of 92 files of iceid malware.
Malware families experiment proved the effectiveness of MSIC in classifying malware files to their corresponding families irrespective of the OS they work on or their type. Further, MSIC outperformed the grayscale framework for the used evaluation metrics.

VI. MALICIOUS-BENIGN CLASSIFICATION EXPERIMENT
This experiment aims to build a binary classifier that classifies benign and malicious files. First, MSIC was built and evaluated. Afterwards, the grayscale classification framework has been built and evaluated.

A. DATASET DESCRIPTION
A dataset must be collected to build and evaluate MSIC and the grayscale frameworks. The dataset should contain both benign and malicious files. The collection process used the environment and workstation's specification explained earlier. The sources of the collected files were Malshare, Virusign and Dasmalware websites. The privately collected dataset is described in Table 8 and has been publicly shared to the research community [42]. The total number of the benign files is 2611, where the total number of the malicious files is 2626. The dataset was divided into 60% train, 20% validate and 20% test datasets. Train dataset is used to build the classifier, validate dataset is utilized to select the best parameters during the building process. Test dataset is used to evaluate the built classifier in classifying unseen malicious and benign files.

B. MSIC RESULTS AND ANALYSIS
The collected files were converted into spectrogram images to be fed to the CNN classifier. The same process in the previous experiment was followed for the conversion into spectrogram images. Wav sampling rate and bit depth values and Hanning window were used for the conversion purpose. The resulting spectrograms of all the benign and malicious files were used to build and evaluate the CNN classifiers. Train and validate datasets were used to build and select the most suitable parameters for the built CNN classifiers. Data augmentation was applied to reduce the overfitting impact. The same parameters listed in the previous experiment were used to select the best classifier.
The following findings were arrived at after intensive experiments. For learning rates 0.1 and 0.01, the accuracy results were low and the loss results were high, for any number of the filters or the use of the dropout layer. Learning rate 0.001 provided the best accuracy and loss results for the various conducted experiments. 16 filters with the dropout layer achieved slightly more accurate results with lower fluctuation loss behavior as compared to using 16 filters without the dropout layer. Similarly, 32 filters with the dropout layer showed better accuracy and less loss, compared to 32 filters without the dropout layer. Finally, 64 filters showed a considerable fluctuation in the high losses, whether the dropout layer was used or not. However, the accuracy results were acceptable. Compared to 16 and 64 filters, the classifier with 32 filters and with the dropout layer provided the best accuracy and loss results along with the least fluctuation throughout the 100 epochs. Therefore, these were selected as the best parameters for the rest of the experiment.
To determine the network structure, three CNN models have been evaluated through 100 epochs against the validate dataset, using 1, 2 and 3 layers. The summaries of the parameters for each model are shown in Table 9. The accuracy and loss behavior for CNN_1, CNN_2 and CNN_3 are shown in Figure 8 and Figure 9. CNN_3 model showed the most stable and highest accuracy results throughout the 100 epochs. CNN_1 and CNN_2 showed less accuracy and stability results than CNN_3. On reaching the 100 th epoch, CNN_1 and CNN_2 gave similar accuracy results. The loss figure reflects that CNN_1 had the most fluctuating behavior, with high spikes during the 100 epochs. CNN_2 was the most stable with the best loss pattern, starting from epoch 20. Although CNN_3 reflected higher fluctuation pattern than CNN_2, it became stable with the same result as those of CNN_2 by epoch 100.
The three classifiers have been evaluated against unseen samples, using the test dataset. Table 10 shows the accuracy, recall, precision and F-measure results for benign and malicious classes and their average. While all the classifiers achieved good accuracy results, CNN_3 provided the best performance. CNN_1 correctly identified 90.4% and 96.6% of the benign and malicious samples, respectively. The predicted precision for benign and malicious samples was 96.5% and 90.6%, respectively. CNN_2 identified 93.5% of the benign samples, with a precision of 95.3% and detected 95.2% of the malicious files with a precision of 93.3%. CNN_3 detected 95.2% of the benign samples with a precision of 96.9% and identified 96.8% of the malicious samples with a precision of 95.1%. CNN_3 outperformed the other two classifiers, with an average recall precision, F-measure and accuracy of 96%.

C. GRAYSCALE FRAMEWORK RESULTS AND ANALYSIS
The grayscale images of all the benign and malicious files were used to build and evaluate the CNN classifiers. Train and validate datasets were used in building and selecting the most suitable parameters for the built CNN classifiers. Data augmentation was used on the training dataset. The same parameters as used in the previous experiment were used, except that the input shape selected was 200 × 200 × 1.
The applied experiments showed the following findings. The learning rates values of 0.1 and 0.01 achieved low accuracy and high loss results, whatever was the number of filters   used. On the other hand, the learning rate 0.001 attained high accuracy with low loss results. The experiments showed the effectiveness of adding the dropout layer for 16, 32 and 64 filters. The addition of the layer raised the level of accuracy and lowered the loss fluctuation. The deployment of 32 filters provided the best accuracy and loss results compared to  16 and 64 filters. Amongst all the filters, a classifier with 64 filters showed the highest fluctuation loss pattern. As a result, the best parameters for the classifier were learning rate 0.001, 32 filters with the addition of the dropout layer. These were used for the remaining of the experiment.
Three network structures were built and evaluated using 1, 2 and 3 layers. The accuracy and loss results achieved are shown in Figure 10 and Figure 11. The accuracy results of CNN_1 were the lowest but improved by the end of the 100 epochs. On the other hand, CNN_1 achieved good loss results, despite the spike in epoch 20. CNN_2 showed better accuracy than CNN_1 throughout the 100 epochs but its accuracy dropped just before the 100 th epoch. CNN_2 showed the worst loss results with a high fluctuation pattern during the 100 epochs, especially right before the 100 th epoch. Among the three structures, CNN_3 achieved the best accuracy and loss results all through the 100 epochs. The number of parameters for the three classifiers was the same as in the MSIC experiment.  The three built structures were tested with the test dataset to evaluate their effectiveness in classifying unseen files. The test results are presented in Table 11. CNN_1 successfully identified 88.1% of the benign samples and 94.8% of the malicious samples. The precision's values of the benign and malicious classes were 94.7% and 88.4% respectively. CNN_2 detected 95.6% of the benign samples and 86.8% of the malicious samples, that is, precision values of 88.3% and 95%. CNN_3 outperformed the other two classifiers with average recall, precision, F-measure and accuracy results of 95.5%. CNN_3 could detect 94.1% of the benign samples and 97% of the malicious samples. The precision was 97% for benign samples and 94% for the malicious samples. Table 12 illustrates the best results achieved with both MSIC and grayscale frameworks in terms of classifying malware families and distinguishing malicious-benign files. MSIC is seen to have performed better on recall, precision, F-measure and accuracy metrics on both tests.
The computational time of the different phases of MSIC and grayscale frameworks are depicted in Table 13 for both malware families and malicious-benign experiments. For the malware families' experiment, MSIC required 723.96 seconds to convert the 9187 malware files into spectrogram images, with an average of 0.079 seconds per file. On the other hand, the grayscale framework spent 3837.58 seconds to convert the malware files, with an average of 0.42 seconds per file. The image resizing process of MSIC framework required less time than the grayscale framework, with 148.74 seconds compared to 250.08 for all the images. The classification time of the test dataset were almost identical for both frameworks. On average, a single image required 0.027 seconds to be classified.
For the malicious-benign experiment, MSIC required 720.74 seconds to convert the 5237 files into spectrogram images, with an average of 0.14 seconds per file. On the contrary, grayscale converted the files in 2424.28 seconds, with an average of 0.46 seconds per file. MSIC's resizing process time was less than the grayscale framework by 109.92 seconds. The classification time of the test dataset were nearly equal. On average, a single image needed.012 seconds to be classified. It can be concluded that the image preparation in MSIC framework, in terms of conversion and resizing, needed less time than the grayscale framework. The classification times were almost equal since both have the same CNN parameters and input shape.

VII. CONCLUSION
The arms race between cybersecurity analysts and cybercriminals is intensifying. The cybercriminals' attacks increasingly better organized as the financial rewards are great and cause severe direct and indirect losses for individuals and corporations. The prevention of cybercriminals' attacks is hampered by polymorphism and packing, the techniques used by the criminals to evade detection. Static analysis detects known malware with great accuracy but fails to overcome the evasive techniques of polymorphism and packing. Dynamic analysis is effective against the evasion techniques but shows a high frequency of false positive results. Image processing detection overcomes the limitations of the evasion techniques. Its additional advantage is that it does not require domain knowledge for implementation. Grayscale malware image classification has been extensively studied and its effectiveness in classifying malware to their corresponding families and differentiate them from benign files has been proved. This paper has introduced a novel framework, MSIC, that classifies the malicious files to their corresponding families and distinguishes them from benign files. The evaluation experiments have proved the effectiveness of the proposed solution and showed its performance to be better than that of the grayscale framework used extensively in the past research. MSIC's highest F-measure and accuracy results in classifying malware files to their corresponding families were 91.6% and 92.8% respectively as compared to 90.6% and 92.3% for grayscale image classification. Further, MSIC scored 96% on F-measure and accuracy in distinguishing malicious from benign files as compared to 95.5% achieved with the grayscale solution. MSIC also proved to be faster than grayscale framework in processing for file conversion and resizing. This paper also offered the cybersecurity community a publicly labeled malware dataset that has been privately compiled in our labs.
There are four possible directions in which this research may proceed. First, the evaluation of new parameters during the building process of the classifier. Second, deployment of different sampling rates and bit depth for mutual comparison of accuracy and loss results. Third, the evaluation of techniques other than resizing, such as zero-padding for small file sizes and converting a specific number of bytes for large file sizes. The latter would help in reducing the requirement of resources since not all the bits of the file are converted. Fourth, the study of the effectiveness of the proposed solution in other malware areas, especially in malware authorship analysis [43].