An Android Malware Detection Model Based on DT-SVM

In order to improve the accuracy and efficiency of Android malware detection, an Android malware detection model based on decision tree (DT) with support vector machine (SVM) algorithm (DT-SVM) is proposed. Firstly, the original opcode, Dalvik opcode, is extracted by reversing Android software, and the eigenvector of the sample is generated by using the n-gram model. ,en, a decision tree is generated via training the sample and updating decision nodes as SVM nodes from the bottom up according to the evaluation result of the test set in the decision path. ,e model effectively combines DT with SVM. Under the premise of maintaining a high-accuracy decision path, SVM is used to effectively reduce the overfitting problem in DT and thus improve the generalization ability, and maintain the superiority of SVM for the small sample training set. Finally, to test our approach, several simulation experiments are carried out, and the results demonstrate that the improved algorithm has better accuracy and higher speed as compared with other malware detection approaches.


Introduction
In recent years, mobile Internet has played a leading role in the evolvement of the Internet, and smartphones have become almost an indispensable tool in people's daily life. Smartphone penetration among adults in developed countries will reach 90 percent by the end of 2023, compared with 85 percent in 2018, and global smartphone sales will reach 1.85 billion units, 19% increase over 2018 [1]. According to [2], worldwide sales of smartphones to end users are on track to reach 1.57 billion units in 2020, an increase of 3% year over year. Although the market sales of smartphone went through a slight declination in 2019, Gartner forecasts that sales of 5G mobile phones will total 221 million units in 2020, and more than double in 2021, to 489 million units; there is no doubt that the gradual maturity of 5G technology will also push the demand of smartphones rise considerably.
Currently, the common operating systems of smartphone terminals include iOS, Android, and Windows Phone, among which Android, in particular, became the dominating operating system with the highest market share on a global scale because of its open-source nature, which gives users and developers the flexibility to customize basic functionality [3]. According to survey data released by Gartner, the share of the Android system in 2017 was as high as 85.9% [4]. However, the increasing popularity of Android is also accompanied by the proliferation of malware. In 2018, 360 Internet Security Center intercepted about 4.342 million new malicious samples on the mobile terminal, with an average of about 12,000 new ones added every day. e new malware types are mainly tariff consuming, accounting for about 63.2%, followed by privacy theft 33.7%, malicious deduction 1.6%, rogue behavior 1.2%, and remote control 0.3% [5]. e terminal application endangers the users' interests by allowing unauthorized access to privacy-sensitive information, rooting devices, monitoring their daily behaviors, etc. [6]. e amount of malware continues to grow at a faster rate each year and poses a serious security threat, antivirus vendors detect thousands of new malware samples daily, and there is still no end in sight [7]. In particular, with the gradual maturity of 5G technology, which marks the arrival of the era of intelligent networking and industrial Internet, the Internet of everything will lead to more lethal and wider harm caused by malware, and hence, malware detection has been and will be a critical topic in computer security.
In this study, we develop a DT with the SVM algorithm (DT-SVM) for improving the detection efficiency and accuracy of malware on the Android platform. e major contributions of this work can be summarized as follows: (i) We develop an advanced machine learning algorithm, which firstly extracts the opcode of samples; then, n-gram is utilized to vectorize and train the sample to generate the decision tree; and, finally, the nodes with high error are updated from the bottom up as SVM nodes. e algorithm combines the advantages of DT and SVM; on the premise that high accuracy is maintained, the SVM node is employed to reduce the overfitting problem caused by DT. erefore, the algorithm takes full advantage of the SVM in a small sample set and has a better classification effect than merely using DT or SVM separately. (ii) We design an Android malware detection framework based on DT-SVM algorithm. e framework is trained based on the improved learning algorithm with the malicious and benign applications utilized, and feature vectors of these applications are generated by Android reverse engineering, feature engineering, and n-gram, which are used as the input of the proposed algorithm for malicious detection. In this way, users can employ our proposed framework to distinguish whether the application is malicious or benign before installation; thus, the Android platform security issues can be greatly improved. (iii) We verify the effectiveness of our advanced algorithm based on real-word benign applications and malware, perform malicious detection on the same dataset of the proposed algorithm with the shallow learning algorithms DT and SVM and the deep learning algorithms CNN and LSTM, and use four evaluation metrics (Precsion, ACC, Recall, F1) as well as time consumption to measure the performance of the algorithm. e results demonstrate that our proposed algorithm performs better than SVM, DT, and LSTM almost in all metrics and performs better than CNN in some metrics. All the four metrics, that is, Precision, Recall rate, ACC, and F1, increase by nearly 0.01% compared with SVM, while the time consumption reduces to one-tenth, as well as increasing by nearly 0.03% separately compared to DT with time consumption not changed much. Compared with CNN, although ACC and F1 are lower, Precision and Recall are higher; furthermore, our algorithm takes less time, and the implementation process is much simpler. In terms of LSTM, our method performs better than it in all metrics. e remainder of this paper is organized as follows. Section 2 states some current work of Android malware detection. Section 3 depicts the related methodology. Section 4 describes the proposed classification algorithm. Section 5 illustrates the Android malware detection framework and explains the specific process of applying the proposed algorithm to the detection of malicious applications. Section 6 verifies the effectiveness of the advanced algorithm based on Android applications. Section 7 concludes the paper and points out the main limitations and future directions.

Related Work
ere have been a lot of achievements in terms of detecting malware on the Android platform, which can be divided into two analysis approaches, that is, static analysis and dynamic analysis [8]. Static analysis is the process of analyzing the code or binary without executing it. Dynamic analysis is the process of studying traces of the malware (API, system calls, permission, etc.) through running the sample in a controlled and isolated environment [9]. Traditionally, malware detectors have been built on handmade detection patterns that are not usually applicable to new instances of malware; however, the increasing number and diversity of these applications make traditional defenses largely ineffective; Android smartphones often fail to protect themselves from new malware [10]. Owing to the emergence of machine learning technology, which can potentially detect never-before-seen attacks or variants of known malware with its strong generalization and prediction ability, machine learning-based methods are increasingly applied to Android malware detection by researchers, and the improvement of classic algorithms has always been the tireless work of scholars. e shallow learning model and the deep learning model are the two main types of machine learning techniques [11]. e shallow learning model usually includes SVM, DT, and k-means as well as k-nearest neighbor (KNN) algorithms, etc. [12]. Reference [13] improved the accuracy of the classifier by using machine learning to extract features from the system call of Android malware. Due to the high feature dimension in Dalvik opcode-based detection, [14,15] utilized two strategies of probability statistics and feature extraction to effectively reduce the dimensionality of extracted features, and the linear SVM was employed for classification, and therefore, the inspection efficiency was improved. Based on the characteristics of permission information and Intent information in AndroidManifest.xml file, a random forest improvement algorithm based on weighted voting was proposed in [16], and the inability to distinguish strong and weak classifiers was solved. Nancy and Sharma [17] compared the network traffic of malware with that of benign applications to find out the characteristics that distinguish the two types of traffic and built a DT classifier to detect normal and malicious applications from the test dataset. e results showed that the network traffic analysis method was efficient in detecting Android malware, with an accuracy rate of more than 90%. Nevertheless, most of the work mentioned above has not achieved decent performance. Recently, Android malware researchers have also been exploring deep learning classifiers for malware analysis to increase detection accuracy [18]. Cui et al. [19] took the advantage of the performance of deep learning in image recognition; the malicious detection code was converted into a grayscale image as the input of CNN under the condition of the fixed image size, which was not realistic in a real scenario. erefore, this method suffered from fluctuating in performance when processing different sizes of images. To improve the accuracy of malware detection and reduce the training time, Wang et al. [20] proposed a hybrid model based on deep autoencoder (DAE) and convolutional neural network (CNN); the experiments demonstrated a significant improvement compared with traditional machine learning methods in Android malware detection. Wang et al. [21] ranked the permissions w.r.t. their risk to the Android system and evaluated the feasibility of using permission requests for malapp detection with different subsets of risky permissions and classification algorithms; the detection rate can achieve 94.62%. Furthermore, the author considered the issue of user privacy information leakage in literature [22] and implemented a framework called 'Alde' to detect the users in-app actions collected by analytics libraries; experimental results show that some apps indeed leak users personal information through analytics libraries. Lei et al. [23] adopted more advanced features than the API event behavior model as a data source, using different behavior patterns of events and the semantic relationships between events to detect malicious software. is method can effectively solve the problem of confusion deformation. However, the results of the experiment performed quite well only in the malware dataset provided in 2013. As the complexity of the malware increased, the detection ability declined.
In summary, it can be concluded that there are two ways to improve the detection accuracy and efficiency of Android malware; the first is through optimization of feature selection and detection model, and the second is to optimize classification algorithms. We mainly focus on the latter and improve the classic classification algorithm in this study. SVM is simple and can achieve high classification accuracy. However, it is merely suitable for small samples; if the sample set is large, it will consume a lot of time and have a high false positive rate. DT is easy to overfit, leading to weak generalization ability of prediction results. To overcome these limitations, our work proposes an advanced learning algorithm based on static features and combines the advantages of SVM and DT algorithm, and the experimental results are quite good. In the next section, we explain the methodology.

N-Gram
. N-gram model is derived from Natural Language Processing (NLP), commonly used in large-scale continuous speech recognition, which believes that the appearance of the N th word must be related to the first N − 1 words, but not to other words. Hence, the probability of the entire sentence should be equal to the probability product of the occurrence of each word. N-gram can also be used in malware detection. As early as 2008, Moskovitch et al. [24] proposed the opcode n-gram scheme and achieved good detection results. [25] is a two-class model whose basic model is a linear classifier that defines the interval maximization in the eigenspace. Meanwhile, it can also solve the nonlinear problem employing kernel trick [26]. e learning strategy of SVM is to maximize the interval, which can be formalized as a problem of solving convex quadratic programming, also called the maximum edge algorithm, whose advantage lies in strong generalization ability, which can solve the issues of nonlinear, small samples, high dimension, etc. Taking the linear separable SVM as an example, the principle of SVM is to search for a separable hyperplane in given eigenspace and then divide the sample space into two categories, one is a positive class and the other is a negative class, corresponding to two different categories of samples.

Support Vector Machine (SVM). SVM
e hyperplane H in the support vector machine can be represented by the equation of w · x + b � 0, where w is the normal vector and b is the intercept, as shown in Figure 1.
When the training samples are linearly separable, there are many straight lines that can correctly classify the two types of samples, and SVM is to find the line that can correctly divide them with the largest interval. SVM also supports nonlinear problem classification, whose main character is the utilization of kernel trick, the basic idea behind which is to match the input space to an eigenspace, so that its hypersurface model in the input space corresponds to the hyperplane model in the eigenspace through a nonlinear transformation. e radial basis function (RBF) is one of the commonly used kernel functions.

Definition 1. Gaussian kernel function
Here, ‖x − z‖ 2 2 is the square Euclidean distance of two eigenvectors, and σ is a free parameter. [27] is a basic classification and regression method, which classifies samples into a tree structure, represents the process of classifying samples based on features in classification problems, and can also be considered as a collection of if-then rules. DT is widely used because of its intuitive feature description, high classification accuracy, and simple implementation [28]. e learning process of DT is to find a mapping relationship between the object attribute and the object value, enabling it to generalize a set of classification rules represented by tree structure from random samples. e decision path of DT has important properties: mutual exclusion and completeness; that is, each instance is covered by the one and the only one path. e learning algorithm of DT includes feature selection, decision tree generation, and pruning process. e widely used Security and Communication Networks generation algorithms of DT are ID3, C4.5, and CART. e Gini index is used for optimal feature selection in CART algorithm.

Definition 2. Gini index
In the classification problem, suppose that there are K classes and the probability that the sample belongs to the k th class is p k ; then, the Gini index of the probability distribution is defined as In the dichotomy problem, the Gini index of the sample set D is expressed as (3) Here, |C k | represents the number of samples in category k, and |D| represents the total number of samples. e Gini index indicates the uncertainty of the sample set. e larger the Gini index, the greater the uncertainty of the sample set.

Decision Tree with SVM Algorithm (DT-SVM)
To overcome the problem of overfitting and weak generalization ability in DT algorithm, DT-SVM is proposed. SVM is embedded into DT for node optimization, which not only ensures the high accuracy of the decision path and improves the generalization ability of DT, but also takes advantage of SVM in small sample training. DT-SVM aims to create a decision model as shown in Figure 2. e process of the algorithm is to generate a decision tree based on the sample set and then update the decision node from the bottom up.
DT is a supervised learning algorithm. e sample set S � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N ) is divided into the training set and the test set, denoted by TrainSet and TestSet.
Definition 3. Assume that the decision tree is as shown in Figure 3, where the leaf nodes are instance sets, represented by S � d 1 , d 2 , . . . , d n , where n is the number of leaf nodes. e nonleaf node is a feature set and is denoted by C � c 1 , c 2 , . . . , c n . Each leaf node corresponds to a decision path, the decision path corresponding to the leaf node j is defined as dp j � c 1 , c k , . . . , d j , and h � len(dp j ) indicates the depth of the path. e details of our suggested DT-SVM algorithm for Android malware detection are presented in Steps 1 to 8.
According to the algorithm process, assume that the initial decision tree is shown in Figure 4 and the DT-SVM tree generated by the algorithm is shown in Figure 5. e algorithm has a good performance in the example illustrated by Figure 6, in which the sample set cannot be effectively segmented, adopting DT and SVM algorithm separately, but the DT-SVM algorithm can preserve the high precision decision path and optimize nodes with low precision as SVM nodes.

Model
Overview. e DT-SVM-based malware detection framework is shown in Figure 7. e framework consists of four modules, that is, instruction extraction, feature engineering, classifier training, and result evaluation.

Sample Instruction Extraction.
Firstly, those samples are labeled as two categories, positive and negative. en, opcode extraction is performed for each apk. After apk decompression, the core classes.dex file of the app will be obtained. e classes.dex file is the executable file of the Android system, which contains all operation instructions and data required by the runtime. e dex file can be parsed by 010 Editor, and its structure is shown in Figure 8. e Methods structure contains all the methods of the app, represented by the DexMethod structure.   Security and Communication Networks Step 1. According to the training set TrainSet, the Gini index is used for feature selection and prepruning, and the decision tree T is constructed.
Step 2. Use the test set TestSet to evaluate the decision tree and calculate the Precision of each decision path p i , then constitute the decision object do � (dp i , p i , h i ), and set the decision path accuracy threshold Th. Step 3. Initialize the queue Q � { }, sort the decision objects generated in step 2 in descending order according to the path depth h of the decision path dp, and sequentially add them to the queue Q.
Step 4. Determine if the queue is empty. If it is, the algorithm ends. Otherwise, go to step 5.
Step 5. Fetch the element q � (dp, p, h) from the queue, and compare the decision path Precision rate p with the preset threshold Th. If it is less than the threshold, go to step 6; otherwise, retain the decision path and go to step 4. Step 6. Determine whether the sibling node of q is a leaf node. If it is, go to step 7; otherwise, go to step 8.
Step 7. Determine whether the Precision of the path of q's sibling nodes is lower than the threshold Th. If it is, all the samples passing through the two decision paths (both path of q and q's siblings) are taken as a training set, which is trained with the SVM model and then merged and updated as SVM nodes; thereafter, the process proceeds to step 4.
Step 8. Take out all the training sets of the path of p, train them with the SVM model, and update them to SVM nodes. en, go to step 4 and continue to traverse so as to update nodes.
ALGORITHM 1: e detailed procedure of DT-SVM.     classified; then, irrelevant instructions are removed; and, finally, only eight types are left. e opcode and its corresponding identifier are shown in Table 2.
After simplifying the Dalvik instruction sets, all of them can be input to the n-gram model to generate sample eigenspace. e extracted opcode for each sample in Section 5.2 is mapped to the identifier, and the n-gram vector is constructed. Assuming that the Dalvik instruction is   Security and Communication Networks the value of the feature is set to 1; otherwise, it is set to 0; the feature vector of the sample is finally obtained.

Evaluation Metrics.
Four metrics are employed to verify the performance of our proposed algorithm, namely, Precision, Recall, classification accuracy ACC, and F1 value, which are broadly used in machine learning. e Precision can be denoted as where TP (true positive) indicates the number of Android malware samples which are correctly detected and FP (false positive) indicates the number of benign applications that are wrongly detected as Android malware [29]. In this study, the Precision refers to the ratio of the identified malicious samples to the real malicious samples. e Recall can be formulated as where FN (false negative) indicates the number of Android malware samples that are not detected (predicted as benign applications) [29]. In this study, Recall reflects the proportion of malicious samples identified in the real malicious sample. e ACC can be formulated as where TN (true negative) represents the number of benign applications that are correctly classified and ACC is an overall evaluation of the classifier, representing the proportion of the total number of the applications that are correctly classified whether as benign or malicious. e higher the ACC is, the better the performance will be. F1 is the harmonic mean between the Precision and Recall; it can be denoted as 6. Experimental Simulation 6.1. Datasets. In the experimental simulation environment, the malicious sample set was obtained from the malicious sample database in the Drebin project of the University of Gottingen, Germany [30], in which the malware samples are 5560 in total, and the time range was from August 2010 to October 2012. An overview of the top 20 malware families in the dataset is provided in Table 3, including several families that are currently actively distributed in application markets. ere are 4414 benign samples, and the benign samples were randomly selected from the applications, which were downloaded from the Google Play app store in order of ranking through the crawler module. e tools used in the experiment include unzip, dexParser, scikit-learn, etc. Scikitlearn is an excellent Python programming machine learning library, which has a variety of classification, regression, and clustering algorithms, including support vector machine, random forest, and gradient enhancement.

Experimental Procedure.
e sample set was divided into a training set, a pseudo test set, and a test set in the ratio of 6 : 2:2. e training set feature vector was input into the DT-SVM model for training. e pseudo test set was used to update the decision node and obtain the DT-
In order to ensure that the decision leaf node has sufficient sample capacity for SVM training, the decision tree needs to be prepruned. In the experiment, the minimum sample number of the leaf node min_samples_leaf is 40, the maximum depth of the decision tree max_depth is 5, and the Precision threshold is set to 0.9. e decision tree path below the threshold is shown in Table 4, where the field of 'Path matrix' is the binary representation of decision path. e encoding process is to sort all nodes of a decision tree from left to right and from top to bottom, and then map them to a multidimensional vector. e position of this multidimensional vector represents the sort of decision tree node, and the value represents whether the decision path contains this node. If it is 1, the node is included; otherwise, it is not included.
For these decision paths with higher error, the samples under each path are taken out for SVM training to generate SVM nodes. e Gaussian kernel function is used to process the feature space during training. At this time, there are two essential parameters that need to be adjusted, namely, the C (Penalty factor) and gamma (RBF kernel width). In general, a larger C leads to higher tolerance, but fewer errors, so as to eliminate overfitting. Otherwise, it is easy to result in underfitting. Gamma is a parameter of the Gaussian kernel function. e larger the gamma is, the less the support vector is, and the simpler the model is.
After training, the parameters of each SVM node are shown in Table 5.

Scenario I: e Impact of Different N-Gram
Types on the Classifier. DT and SVM classifiers were trained separately applying different n-gram models, and the predictive Precision results are shown in Table 6.
e results show that DT and SVM can get good evaluation results on the basis of 3-gram and 4-gram, demonstrating the feasibility of the modeling method. When n > 3, the Precision of DT only increases by 0.7%; SVM increases by 2%, but it consumes a lot of time. SVM takes 1002.23 seconds under 4-gram and 113.65 seconds under 3-gram, so n � 3 gives the best performance for sample vectorization.

Scenario II: Results Comparison with Shallow Learning
Algorithm.
e sample was vectorized based on 3-gram, and Table 7 demonstrates a comparison of the proposed algorithm with SVM and DT for Android malware detection. e results show that the Precision, ACC, Recall, and F1 of the DT-SVM algorithm are apparently higher than traditional DT and slightly higher than SVM. In terms of efficiency, SVM takes the longest time, while DT-SVM is trained by DT first, and then the small sample is trained by the SVM node. Hence, the time dramatically reduces compared with SVM, albeit a little longer than DT.

Scenario III: Results Comparison with Deep Learning
Algorithm. We also compared the CNN [31] and LSTM [32] using the same sample set for training. e results show that ACC and F1 of CNN are relatively high, but other metrics are lower than our proposed model, which means that there would be a lot of false positives of CNN. In addition, CNN is time consuming and requires high machine configuration. e performance of the LSTM model for malicious detection of Android is not so good as that of DT-SVM algorithm, and the time consumption is 117s higher than that of our algorithm. e results are detailed in Table 8.

Scenario IV: Comparison of DT-SVM Results with
Different Sample Sizes. We randomly select 507 samples from the 2962-sample set for experiment. e effects of different sample sizes on DT-SVM classifier are shown in Table 9 .
e experimental results show that the sample size has a certain influence on the detection effect. e number of samples increases, and Precison, ACC, Recall, and F1 increase by 0.03. Hence, we can conclude that the larger the sample size is, the better the overall performance will be.

Analysis.
Decision tree is a prediction model, which represents a mapping relationship between object attributes and object values. Its branches classify objects of this type based on attributes. It is a decision tool using a decision model, which can help determine a strategy most likely to achieve the goal. DT is easy to understand and implement, the advantage of which lies in its ability to make accurate and feasible predictions for large data sources in a short time. e basic principle of DT-SVM is to first extract some highaccuracy decisions through DT model and quickly find the strong correlation between the results and the attributes, and then the kernel technique of SVM is used to solve nonlinear prediction for some weakly correlated samples and at the same time give full play to the advantages of SVM in small sample prediction. Hence, the prediction accuracy of the samples is largely improved through the combination of DT and SVM. e time complexity of DT is O(n log n), and SVM is O(n 3 ). However, DT-SVM first generates a decision tree, selects the optimal path, and then uses SVM for training for the remaining samples, so the time complexity is O(n log n) + O(m 3 ), where n is the total number of samples and m is the number of samples that cannot be distinguished with high accuracy after training the sample using the decision tree; m ≪ n; thus, the value falls in the interval (O(n log n), O(n 3 )). In this experiment, decision tree was Security and Communication Networks first used to train samples, and it can be found from Table 4 that the Precision of paths 1, 2, 3, 4, 5 is low, indicating that DTcannot accurately separate positive and negative samples. Taking path 2 (C 296 , C 9 , C 120 , d 1 ) as an example, by mapping and restoring, the opcode sequence corresponding to path 2 is JRG, GPP, PCG, where it is observed that JRG is a jump return to obtain data sequence, GPP is a data acquisition and storage sequence, and PCG is a data dump sequence. ese sequences are often used for both positive and negative samples; therefore, merely using DTcannot distinguish them effectively (the accuracy is only 57.1%). Based on this, this paper trains these undifferentiated samples using SVM, and the Precision reaches as high as 96%. In summary, the proposed algorithm improves detection accuracy, while the time consumption is relatively low.

Conclusion and Future Work
Taking the sample Dalvik opcode as the research object, the n-gram model is utilized to generate the sample eigenvector, and DT-SVM is proposed. Based on the original DT, the proposed algorithm uses SVM to update the decision nodes from the bottom up. e advantages of DT and SVM can be combined through DT-SVM, and the disadvantages of overfitting of DT and low accuracy of SVM for large samples are overcome. Finally, the superiority of the algorithm is demonstrated by simulation experiments, and good results are obtained in Android malicious apps detection.
However, in addition to the above advantages, there are some limitations to our study. is paper only performs static analysis on the sample; if the sample is hardened or confused, the unzip file will no longer be the sample's classes.dex, but the hardened executable file.
e Dalvik opcode will be virtualized, and all instructions will be executed by a hardened virtual machine. At this time, opcode will no longer correspond to the Dalvik instruction list, and only the dynamic behavior analysis method can be used for malicious code detection. In addition, the proposed DT-SVM algorithm can still be improved by, for example, using the random forest to further improve the classification ability of DT-SVM and extending DT-SVM algorithm to the multiclassification decision model.

Data Availability
e data in this paper are divided into benign samples and malicious samples. e malicious sample data that support the findings of this study are available but restrictions apply to the availability of these data, which were used under license for the current study, and so they are not publicly available. ese data are however available from the corresponding author upon reasonable request and with permission of the Drebin project of the University of Gottingen, Germany. e benign sample data generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.