1 Introduction

Malicious softwares (malware) are a serious threat to the security of computing systems [1, 2]. Kaspersky and Labs alone detected more than 121,262,075 unique malware in 2015 [3] while Panda Labs predicted that half of security issues are directly related to malware infections [4], McAffe reported a rise of 744% OS X malware over 2015 in 2016 [5]. The increasing Mac OS X market size (second after Microsoft Windows [6] and its fast adoption rate motivate cyber threat actors to shift their focus to developing OS X malware. The “myth” that OS X is a more secure system only further increases malware success rate. For example, OS X Flashback Trojan successfully infected over 700,000 machines in 2012 [7].

Fig. 1
figure 1

Research methodology

Security researchers have developed a wide range of anti-malware tools and malware detection techniques in their battle against the ever increasing malware and potentially malicious programs, including approaches based on supervised and unsupervised machine learning techniques for malware detection [7]. In approaches using supervised techniques, tagged datasets of malicious and benign programs are required for training. Approaches using unsupervised techniques generally do not require the separation of malware and goodware, and programs are generally classified based on observable similarities or differences [8].

While there have been promising results on the use of machine learning in Windows and Android malware detection [9, 10], there has been no prior work on using machine learning for OS X malware detection. This could be, perhaps, due to the lack of a suitable research dataset and the difficulties in collecting OS X malware.

In this paper, we propose a machine learning model to detect OS X malware based on the Radial Base Function (RBF) in the SVM technique. This provides us a novel measure based on application’s library calling to detect malware from benign samples. We then propose a new weighting measure for classifying OS X goodware and malware based on the frequency of library calling. This measure weights each library based on its frequency of occurrence in malware and benign applications.

These datasets are then evaluated using four main classification techniques, namely: Nave Bayes, Bayesian Net, Multi Layer Perceptron (MLP), Decision Tree-J48, and Weighted Radial Basis Function Kernels-Based Support Vector Machine (Weighted-RBFSVM). The following performance indicators are used for evaluating the performance of our machine learning classifiers:

True Positive (TP): shows the ratio of goodware classified as benign;

True Negative (TN): shows the ratio of malware correctly detected as malware;

False Positive (FP): shows that the ratio of malware files identified as benign; and

False Negative (FN): shows the ratio of goodware classified as malware.

Accuracy (ACC): measures the ratio that a classifier correctly detected malware and benign samples (goodware), and is computed using following formula:

$$\begin{aligned} { ACC} = \frac{{ TP}+{ TN}}{{ FN}+{ TP}+{ FP}+{ TN}} \end{aligned}$$
(1)

The False Alarm Rate (FAR) is the rate that a classifier wrongly detected a goodware as malware and computed as:

$$\begin{aligned} { FAR} = \frac{{ FP}}{{ FP}+{ TN}} \end{aligned}$$
(2)

Our research methodology is presented in Fig. 1.

The organization of this paper is as follows. Section 2 discusses related research, and Sect. 3 describes our dataset development. Sections 4 and 5 presents our malware classification and a discussion of this work, respectively. Finally, we conclude in the last section.

2 Literature review

Machine learning techniques have been used for malware detection. Nauman et al. [11] used game-theoretic rough sets (GTRS) and information-theoretic rough sets (ITRS) to show that a three-way decision-making approach (acceptance, rejection and deferment) outperforms two-way (accept, reject) decision-making techniques in network flow analysis for Windows malware detection. Fattori et al. [12] developed an unsupervised system-centric behavioral Windows malware detection model with reportedly 90% in accuracy. Their approach monitors interactions between applications and underlying Windows operating system for classification of malicious applications. Mohaisen et al. [13] proposed an unsupervised behavioral based (dynamic) Windows malware classification technique by monitoring file system and memory interactions and achieved more than 98% precision. Huda et al. [14] proposed a hybrid framework for malware detection based on programs interactions with Windows Application Program Interface (API) using Support Vector Machines (SVM) wrappers and statistical measures and obtained over 96% detection accuracy.

Nissim et al. [15] proposed an SVM-based Active Learning framework to detect novel Windows malware using supervised learning with an average accuracy of 97%. Damodaran et al. [16] utilized Hidden Markov Models (HMMs) to trace APIs and Opcodes of Windows malware sequences and developed a fully dynamic approach for malware detection based on API calls with over 90% accuracy. Mangialardo and Duarte [17] proposed a hybrid supervised machine learning model using C5.0 and Random Forests (RF) algorithms with an accuracy of 93.00% for detecting Linux malware.

Due to the increasing use of smart devices such as Android and iOS devices, there has been a corresponding increase in the number of Android and iOS malware [18,19,20]. Suarez-Tangil et al. [21], for example, proposed an Android malware detection model. Yerima et al. [22] utilized ensemble learning techniques for Android malware detection and reportedly had an accuracy rate between 97.33 and 99%, with a relatively low false alarm rate (less than 3%). Saracino et al. [23] designed a system called MADAM which is a host-based Android malware detection. The MADAM was evaluated using real world apps.

OS X malware has also been on the increase [24], but there is limited published research in OS X malware analysis and detection. For example, a small number of researchers have developed OS X malware and Rootkit detection techniques, and malware detectors by tracing suspicious activities in memory (like unwanted access, read, write and execute) [25,26,27]. However, applying machine learning to detect OS X malware is limited to the Walkup approach [28], which utilized Information Gain (IG) to select effective features for supervised classification of OS X malware. Hence, development of machine learning techniques for OS X malware detection is the gap that this paper seeks to contribute to.

3 Dataset development

Fig. 2
figure 2

MacOS application bundle structure

As part of this research, we collected 152 malware samples from [29,30,31]. These samples were collected between Jan 2012 and June 2016 thus OS version which can run them are in following order: OS X 10.8 (Mountain Lion), 10.9 (Mavericks), 10.10(Yosemite) and 10.11(El Clapton). Duplicated samples were detected by performing a SHA-256 hash comparison and removed from the datasets. Known OS X malware such as WireLurker, MacVX, LaoShu, and Kitmos are among the malware in our dataset. Similar to previous datasets such as those of Masud et al. [32], in order to build a non-biased dataset for detecting malware as anomalous samples, we need at least 456 goodware (three times the number of malware, compared to the number of malware) in our datasets.

To start with how the dataset collected, we first presented an overall definition of each MacOS X application in Fig. 2. As it can be seen if you extract each OS X application bundle you would usually encounter a directory, named Contents. This directory also consists files and some component as follows [33]:

Contents: This directory is main part of each application bundle and contains several directory and files which is introduce as follows:

info.plist: This fill consist the configuration information for the application. The Mac Operating System relies on the presence of info.plist to realize related information about the application and other relevant files.

MacOS: Consists the applications executable code file (Mach-O). Usually, this directory comes with only a binary file with the applications main entry point and constantly linked code.

Resources: Consists all resource files of the application i.e. picture, Audio, Video and etc.

Framework: Consists all private shared library of the application and the framework which used by executable code.

PlugIns: Consists all loadable files and libraries which extend application features and capabilities.

SharedSupport: Consists all non-critical resources which not extend the application capabilities.

Fig. 3
figure 3

The process of dataset development

Therefore, we randomly downloaded a total of 460 apps of top 100 apps listed in Utilities, Social Network, Weather, Video and Audio, Productivity, Health and Fitness and Network categories of the Apple App Store [34] as of Jun 2016. Dominance of benign samples in the collected dataset was due to obtain desirable results in False Alarm rate by training the classifier with more goodware and detect anomalies from them just like real world benchmark dataset on anomaly detection which provided in [35,36,37]. We then extracted the Mach-O binaries of all malware and beningware samples in the respective datasets manually. Mach-O binaries are the executable portion of an OS X application [38] and consist of three sections as follows (see also Fig. 3):

  1. 1.

    Header contains common information about the binary such as byte order (magic number), CPU type, and number of load commands. Load Commands section contains information about the logical structure of an executable file and data stored in the virtual memory such as symbol table and dynamic symbol table.

  2. 2.

    Load Commands contains information about logical structure of an executable file and data stored in the virtual memory such as symbol table, dynamic symbol table, etc.

  3. 3.

    Segments is the biggest part of each Mach-O files which contains application code and data.

We wrote a Python script [39] to extract features from Mach-O files (Table 1). Our script parsed each Mach-O binary and created three separate output files as follows:

Mach-O HD: This file contains all Mach-O Header information such as CPU type, number of commands, and size of commands.

Mach-O LC: This file includes all information about library import/export, symbol table and string functions.

Mach-O SG: This file provides the raw data of three Mach-O file sections (i.e. Data, Text and Segment) (Table 1).

3.1 Data preprocessing

Similar to many other malware machine learning datasets, our datasets include several features with missing values; thus, we utilized K-Nearest Neighbor (KNN) imputation technique [40] for estimation of missing values. The imputation technique is performed in two steps, as follows:

  • Utilizing Euclidean distance for computing distance between each missing value (i.e. \(x_i\)) and all other samples without a missing value to detect the K nearest samples.

  • Impute the missing value of \(x_i\) by computing the average value of the K nearest samples.

Since extracted features values are in different ranges, a normalization technique is used to increase the SVM performance. As all extracted features are Integer values (except Library Name), Eq. 3 can be used to convert them to \([0-1]\) interval.

$$\begin{aligned} X_n= & {} \frac{x_i - { min}\{{ feature}_d\}}{{ range}_d}, \nonumber \\ { range}_d= & {} { max}\{{ feature}_d\} - { min}\{{ feature}_d\} \end{aligned}$$
(3)
Table 1 OS X dataset features

In Eq. 3, \(x_N\) and \(x_i\) denote the respective normalized value and raw extracted value of the feature in dth dimension. Figure 4 shows the overlap of the collected datasets between two features vectors which belong to malicious and benign class before and after preprocessing. It is clear that there are minimal overlaps and the class borders are more distinct.

Fig. 4
figure 4

a Probability density function (PDF) of sizeOfcmds and bindSize features before pre-processing b probability density function (PDF) of sizeOfcmds and bindSize features after pre-processing

3.1.1 Feature selection

Feature selection techniques are used to find the most relevant attributes for tion. At this stage, the three common feature selection technique (Information Gain, Chi-Square and Principal Component analysis) for malware detection based on code inspection Shabtai et al. [41, 42] were applied. Information Gain (IG) [43] is a technique used to evaluate attributes to find an optimum separation in classification, based on mutual dependencies of labels and attributes. Chi-square measures the lack of independence between attributes [44]. Principal Component Analysis (PCA) can be used to perform feature selection and extraction. We also used PCA as a feature selection mechanism to select the most informative features for classification. After the feature selection methods were used to calculate the relevant scores, features with the highest scores will be considered.

Suppose we have m class labels (for binary classification \(m=2\)), c class and t be the number of attribute dimension to be evaluated, the IG scores can be obtained using Eq. (4) as follows:

$$\begin{aligned} \begin{aligned} G(t)&= - \sum _{i=1}^{m} { Pr}(c_i) \log { Pr}(c_i) + { Pr}(t) \\&= \sum _{i=1}^{m} { Pr}(c_i | t) \log { Pr}(c_i | t) + { Pr}(\bar{t}) \\&= \sum _{i=1}^{m} { Pr}(c_i | \bar{t})\log { Pr}(c_i | \bar{t}) + { IG} \\&= G(t) - G(t_i) \end{aligned} \end{aligned}$$
(4)

Chi-Square method calculates the \({\chi ^2_{{ avg}}}\) (t) (see Eq. 5) score function for attributes as per equation, where N is the sample size, A is the frequency of co-occurrence of t and c together, B is the frequency of occurrence of t without c, C is the times c happens without t, and D is the frequency without the occurrence of t or c.

$$\begin{aligned} \chi ^2 (t,c)= & {} \frac{N \times ({ AD}-{ CB})^2}{((A+C)\times (B+D)\times (A+B)\times (C+D) )}\nonumber \\ \end{aligned}$$
(5)
$$\begin{aligned} \chi _{{ avg}}^{2} (t)= & {} { Pr}(c_i )\chi ^2 (t,c_i) \end{aligned}$$
(6)

These feature selection methods provided us a sequence of effective features after applying them on the collected datasets based on their parameters (see Tables 2 and 3).

Table 2 Selected features from the three different techniques
Table 3 Features obtained values from ranker search method to select appropriate feature
Fig. 5
figure 5

SMOTE technique uses KNN to generate synthetic sample

3.2 Library weighting

One of the extracted features is system libraries, which are called by an application. In this phase, the probability of calling each and every system libraries is calculated. For each system library, two indicators are calculated. First, the overall occurrence probability of the library in the dataset. Second, the occurrence probability of the library in each of the malware or goodware classes. Then, the sample weight (SW) of each library is calculated for both malign and benign classes as per Eqs. (7) and (8).

$$\begin{aligned} { SW}_{i|m}= & {} \frac{\sum _{[j=1]}^{n} { freq}({ lib}_j|m)_i}{\sum _{[v=1]}^{n} { lib}_v|m} \end{aligned}$$
(7)
$$\begin{aligned} { SW}_{i|b}= & {} \frac{\sum _{[j=1]}^{n} { freq}({ lib}_j|b)_i}{\sum _{[v=1]}^{n} { lib}_v|b} \end{aligned}$$
(8)

In the above equations, \({ SW}_{i|m,b}\) represents ith sample weight for each class (malign or benign) and \({ freq}({ lib}_j|m)_i \) shows that the occurrence number of jth library (lib) called by ith application in malign (m) or benign (b) class (i.e. \({ lib}_v|m\) means jth library in malign class). After these two measures are calculated, we use them as the new features for classification.

3.3 SMOTE dataset development

Synthetic Minority Over-sampling Technique (SMOTE) [45] is a supervised re-sampling technique to balance minority classes. SMOTE is using K-Nearest Neighbors (KNN) algorithm to find the best location in each dimension to generate synthetic samples (see Fig. 5). We used SMOTE to create three datasets of double size, triple size and quintuple size of original dataset all in the same proportion with the original dataset (see Table 4). We believe our collected datasets pave the way for future research in application of machine learning in OS X malware detection.

Table 4 Applied collected and synthetic datasets distribution
Fig. 6
figure 6

Support vectors and maximizing margin

Fig. 7
figure 7

Added library-weighting features and corresponding support vectors

Table 5 Supervised classification results by cross-validation

4 OS X malware classification

Five main supervised classification techniques, Nave Bayes, Bayesian Net, Multi Layer Perceptron (MLP), Decision Tree-J48, and Weighted Radial Basis Function Kernels-Based Support Vector Machine (Weighted- RBFSVM), are then evaluated using our datasets. The main classification task of the proposed methodology is developed using SVM.

The machine learning algorithm in [46] separates data into N-dimensions with different categories in each hyperplane. Then, the dimension with the largest margin will be used for classification. The given training data samples are paired and labeled as (X,Y), where X is the dataset feature vector (which contains features as \({x_1, x_2, x_3,x_n}\)) and Y that represents labels (malicious or benign) for X features.

Both X and Y are fed as inputs to the SVM classifier. SVM is the used to maximize the margin between given classes and obtain best classification result. The boundary of margin function is defined by support vectors data samples. This margin is calculated from candidate support vectors which are those nearest to the optimized margin (the largest margin that separated two types of data) see Fig. 6.

The problem of maximizing margin in SVM can be solved using Quadratic Programming (QP) as shown in Eq. (9).

$$\begin{aligned} { Minimize}:W(\alpha )\!\!= & {} \!-\!\! \sum _{k=1}^{l} \alpha _k \!+\! \frac{1}{2} \sum _{k=1}^{l} \sum _{p=1}^{l} \gamma _k\gamma _p\alpha _k\alpha _p k(\chi _k,\chi _p)\nonumber \\&{ subject}~{ to}{:}~\forall k:0\le \alpha _k \le C\quad { and}\quad \nonumber \\&\sum _{k=1}^{l}\alpha _k \gamma _k = 0 \end{aligned}$$
(9)
Fig. 8
figure 8

Accuracy and false alarm rates among original dataset and synthetic dataset

Table 6 Supervised classification results by cross-validation
Fig. 9
figure 9

Percentage of library intersection in the collated dataset

In the above equation, l denotes the number of training objects, \(\alpha _k\) the vector of l variables in which segment \(\alpha _k\) belongs to the training sample of \(x_k\), and C is the margin parameter which controls effects of noise and outliers within the training set. Samples in training set with \(\alpha _k\hbox {s}\) of greater than zero are the support vector objects. Others with \(\alpha _k\) value of zero are considered non-support vector objects; thus, they are not consider in calculation of margin function.

For better separation, data points in the SVM kernel function are used as \(k(x_k,x_p)\) in the QP equation (see Eq. 9). Kernel functions map training data into higher dimensions to find a separating hyper plane with a maximum margin [47].

There are some common kernel functions such as Linear, Polynomial and RBF and Sigmoid Kernel for SVM classifier. In this research, due to the proximity of data (see Fig. 4), RBF kernel function [48] is utilized (see Eq. 10).

$$\begin{aligned} k(\chi _k,\chi _p) = \exp (-\gamma || \chi _k - \chi _p ||^2) \end{aligned}$$
(10)

Although SVM is a promising supervised classifier, it has its own drawbacks. SVM technique performance and accuracy rely heavily on the training data complexity, structure and size [49]. In our research, the size of training dataset is suitable for SVM classification and there are not too many features. Moreover, our dataset is normalized which reduces the complexity of the training set.

5 Findings and discussion

Using the library-weighting measure, we created two new features, namely: lib-w-b (library-weight-benign) and lib-w-m (library-weight-malware), to increase the accuracy of classification (see Fig. 7). Table 5 presents the evaluation results of Nave Bayes, Bayesian Net, MLP, Decision Tree-J48, and Weighted- RBFSVM on the original dataset with tenfold Cross Validation (CV) technique. Due to data normalization and well-separated features (shown in Fig. 7), it is clear that the weighted-RBFSVM offers the highest accuracy (91%) and lowest false alarm rate (3.9%) (Table 6).

Table 6 shows results of evaluating Nave Bayes, Bayesian Net, MLP, Decision Tree-J48, and Weighted- RBFSVM against our three SMOTE datasets using tenfold Cross Validation (CV) technique. While accuracy is increased in all cases and we have received much higher accuracy (i.e. 96.62% detection rate of Decision Tree-J48 on SMOTE_5X); the false alarm rate is not reduced and more training time is required due to the bigger size of datasets [50]. In Addition, the complexity of classification technique had reduction due to two new added features(lib-w-b, lib-w-m). For instance J48 classification complexity before adding the two new features was 65 nodes and 35 leaves but after providing the new features reduced to 55 nodes and 33 leaves receptively.

Figure 9 depicts the frequency of occurrence of every library calls in the original dataset.

Figure 8 depicts accuracy and false alarm rate for original and SMOTE datasets. While SMOTE datasets are significantly bigger in compare with the original dataset, the proposed model obtained lower false alarm in the original dataset with almost same accuracy of SMOTE datasets.

Fig. 10
figure 10

KS density function for segments

Fig. 11
figure 11

KS density function for SectionsData

A comparison of low ranked features (i.e. Segments, SectionsTEXT, SectionsData) using Kernel Smooth (KS) density estimation shows a significant overlap between low ranked features of malware and benign applications (see Fig. 10); hence, these features are not suitable for classification. The experiments on KS density estimation also suggested that data and text sections had the most overlaps in comparison to other extracted features—see Figs. 11 and 12. According to Fig. 13, the KS density estimation library-weighting provides a distinction between malware and benign samples, since these two curves (malware and benign) are almost orthonormal as the peak of one curve is the opposite trend of the other. Therefore, it can be said that this feature is highly appropriate for classification.

Fig. 12
figure 12

KS density function for SectionsTEXT

Fig. 13
figure 13

KS density function for lib-weighting

Fig. 14
figure 14

Application call library statistics for malign and benign applications

As shown in Fig. 14 CoreGraphics, CoreLocation, Oreservices and Webkit libraries were called a lot more in benign applications while libc and libsqlite3 were called significantly more by malware. Statistical analysis of the library calls revealed that applications that call audio and video related libraries (AudioToolbox and CoreGraphics) are mostly benign while most malicious apps more frequently call system libraries (i.e. libSystem) and Sqlite libraries.

6 Conclusion and future work

In this paper, we developed four OS X malware datasets and a novel measure based on library calls for classification of OS X malware and benign application. We have obtained accuracy of 91% and the false alarm rate of 3.9% using weighted RBF–SVM algorithm against our original dataset. Moreover, using Decision Tree- J48 we obtained 96.62% accuracy using SMOTE_5X dataset with slightly higher false alarm (4%). Moreover the synthetic datasets are generated using SMOTE technique and assessed them by same supervised algorithm. This experiment is conducted to show effect of number of sample size on detection accuracy. Our results indicate that increasing sample size may increase detection accuracy but adversely affect the false alarm rate. OS X malware detection and analysis utilising dynamic analysis techniques is a potential future work of this research. Extending classification using other techniques such as Fuzzy classification, applying deep learning for OS X malware detectionm and using a combination of our suggested features for OSX malware detection are interesting future works of this study.