DL-Droid: Deep learning based android malware detection using real devices

The Android operating system has been the most popular for smartphones and tablets since 2012. This popularity has led to a rapid raise of Android malware in recent years. The sophistication of Android malware obfuscation and detection avoidance methods have significantly improved, making many traditional malware detection methods obsolete. In this paper, we propose DL-Droid, a deep learning system to detect malicious Android applications through dynamic analysis using stateful input generation. Experiments performed with over 30,000 applications (benign and malware) on real devices are presented. Furthermore, experiments were also conducted to compare the detection performance and code coverage of the stateful input generation method with the commonly used stateless approach using the deep learning system. Our study reveals that DL-Droid can achieve up to 97.8% detection rate (with dynamic features only) and 99.6% detection rate (with dynamic + static features) respectively which outperforms traditional machine learning techniques. Furthermore, the results highlight the significance of enhanced input generation for dynamic analysis as DL-Droid with the state-based input generation is shown to outperform the existing state-of-the-art approaches.


Introduction
Android operating system, which is provided by Google, is predicted to continue have a dramatic increase in the market with around 1.5 billion Android-based devices to be shipped by 2021 sta .It is currently leading the mobile OS market with over 80% market share compared to iOS, Windows, Blackberry, and Symbian OS.The availability of diverse Android markets such as Google Play, the official store, and third-party markets makes Android devices a popular target to not only legitimate developers, but also malware developers.Over one billion devices have been sold and more than 65 billion downloads have been made from Google Play ( Smartphone, 0 0 0 0 ).Android apps can be found in different categories, such as educational apps, gaming apps, social media apps, entertainment apps, banking apps, etc.
As a technology that is open source and widely adopted, Android is facing many challenges especially with malicious applications.The malware infected apps have the ability to send text mes-from.However, according to McAfee, Google Play Protect also failed when tested against malware discovered in the previous 90 days in 2017 ( McA, 0 0 0 0 ).Furthermore, most third-party stores do not have the capability to scan and detect submitted harmful applications.Clearly, there is still a need for additional research into efficient methods to detect zero-day Android malware in the wild in order to overcome the aforementioned challenges.
Various approaches have been proposed in previous works with the intention of detecting Android malware.These approaches are categorized into static analysis, dynamic analysis or hybrid analysis (where static and dynamic are used together).The methods based on static analysis reverse engineers the application for malicious code analysis.Arp et al. (2014) , Aafer et al. (2013) , Yerima et al. (2015a) , Fan et al. (2017) , Yerima et al. (2015b) , Kang et al. (2016b) , Cen et al. (2015) , Westyarian et al. (2015) and Kang et al. (2016a) are few examples of detection methods using static analysis.By contrast, dynamic analysis executes the application in a controlled environment such as an emulator, or a real device with the purpose of tracing its behaviour.Several dynamic approaches, such as Enck et al. (2010) ; Alzaylaee, M.K., Yerima, S. Y., and Sezer S. (2016) DroidBox ; Rastogi et al. (2013) ; Tam et al. (2015) tra, NVISO have been proposed.However, the efficiency of these approaches rely on the ability to detect the malicious behaviour during the runtime while providing the perfect environment to kick-start the malicious code.
Deep learning (DL) has gained increasing attention in the machine learning community and is re-emerging as a popular method of AI being applied in many fields ( Hou et al., 2016;2017;Karbab et al., 2017;LeCun et al., 2015;McLaughlin et al., 2017;Yuan et al., 2014;2016 ).DL classifiers have inspired a great number of effective approaches in image classification, natural language processing, and speech recognition.Recently, Android malware researchers have also been exploring DL classifiers for malware analysis in order to increase detection accuracy.
Contrary to previous deep learning based dynamic detection works, this paper proposes and investigates a new system that exploits the advantages of deep learning coupled with dynamic stateful input generation, with the objective of achieving higher accuracy detection of zero-day Android malware.Furthermore, several experiments are conducted using real devices to compare the performance of the proposed DL based approach with those of popular machine learning classifiers.In summary, the main contributions of this paper are as follows: • We present DL-Droid, a deep learning-based dynamic analysis system for Android malware detection.Unlike existing dynamic analysis systems, DL-Droid utilizes a state-based input generation approach for enhanced code coverage thus enabling improved performance.• Using DL-Droid, we investigate the performance of the stateful input generation approach by utilizing the state-of-the-practice stateless (random-based) input generation as a comparative baseline.Higher accuracies were obtained with the stateful approach, thus highlighting the significance of enhanced input generation for Android malware detection systems that utilize dynamic analysis.• We present an extensive comparative study of DL-Droid with seven popular machine learning classifiers.Unlike most existing studies that are based on emulators, our experiments are conducted in a more realistic environment using real devices.
Experimental results show that DL-Droid outperforms the accuracy of traditional classifiers.
The rest of the paper is structured as follows.Section 2 discusses the related work.Followed by the methodology and experiments undertaken to evaluate DL-Droid in Section 3 Sec-tion 4 presents detailed experimental results and discussions of these results.Followed by the conclusion in Section 5

Related work
This section discusses the related work on Android malware detection, automated test input generation for Android, and recent works on deep learning approaches.As mentioned earlier, detecting Android malware with static analysis, where the application will be disassembled to be examined for presence of any malicious code is a popular approach.Several solutions have been developed using the static approach, utilizing features such as permissions, API calls, commands, and Intents.( Aafer et al., 2013;Arp et al., 2014;Cen et al., 2015;Fan et al., 2017;Yerima and Sezer, 2019;Yerima et al., 2016a;2015a;2015b ), are examples of detection solutions based on static analysis.Although static analysis approaches enable more extensive code coverage, malware developers can use obfuscation techniques to hide the malicious code in order to evade static analysis.For example, data encryption, obfuscation, update attacks or polymorphic techniques.Therefore, in this work we only extract Android permissions statically prior to each run, and then extract the API calls and Intents dynamically at run time.
Dynamic analysis approach on the other hand, consist of running Android applications in a controlled environment such as an Android Virtual Device (AVD) emulator emu , or Genymotion Gen or in a real device in order to monitor the apps' behaviour.Alzaylaee et al. (2017) showed that analysing Android application in real phones is more effective in terms of stability and detecting more features compared to the emulator environment.Therefore, we chose to run our analysis on features extracted from real devices instead of emulators.
Automated dynamic analysis of Android apps requires streams of user emulated input events such as touches, gestures, or clicks to enable greater code coverage when run in either emulator or real phone.Choudhary et al. (2015) demonstrated that among the input generation tools analysed comparatively in their study (i.e.Monkey Developers (2012) , Dynodroid Machiry et al. (2013) , ACTEve Anand et al. (2012) , A3E Azim and Neamtiu (2013) , GUIRipper Amalfitano et al. (2012) , SwiftHand Choi et al. (2013) , and PUMA Hao et al. (2014) ), Monkey performed the best in terms of code coverage.Nevertheless, in Alzaylaee et al. (2017) and Yerima et al. (2019) investigations proved that Monkey's code coverage capability could be surpassed by stateful approach enabled by tools such as DroidBot Li et al. (2017) .The same studies have also shown that a stateful input generation is more stable and robust compared to the stateless approach enabled by Monkey.Hence, the deep learning-based system DL-Droid proposed in this paper is based on the dynamic stateful input generation approach.
The difficulty of detecting Android malware manually has led researches to explore the use of machine learning to automate and speed the detection process.Arp et al. (2014) ; Dini et al. (2012) ; Peiravian and Zhu (2013) ; Rasthofer et al. (2014) ; Shabtai et al. (2012) ; Yerima et al. (2015bYerima et al. ( , 2016b) ) are examples of published research that apply machine learning techniques to detect zeroday Android malware.Deep learning is re-emerging as a machine learning approach that is growing in popularity in many fields including Android malware detection.Droid-Sec Yuan et al. (2014) is one of the first frameworks that applied deep learning to classify Android malware, achieving 96.5% accuracy using 200 features extracted by means of a hybrid (static + dynamic) approach evaluated on 250 clean and 250 malware Android apps.Droid-Sec was a preliminary work for DroidDetector Yuan et al. (2016) , where the authors increased the number of the analysed apps to 20,0 0 0 clean and 1760 malware and achieved 96.76% accuracy.Hou et al. (2016) proposed Deep4MalDroid , an automatic Android malware detection system, which will dynamically extract Linux kernel system calls using Genymotion emulator.The best detection accuracy they reached was 93.68% on features extracted from 1500 benign and 1500 malware Android apps.Similarly, Hou et al. (2017) proposed AutoDroid (automatic Android malware detection) based on API calls extracted using static analysis.Their system was developed using different types of deep neural networks (i.e., DBN and SAEs).The best accuracy of the DBN was 95.98% based on experiments with 2500 benign and 2500 malware Android apps.
In contrast to existing deep learning based Android malware detection frameworks, the key differentiates of our proposed DL-Droid framework is its dynamic stateful input generation approach.Furthermore, our work is based on real devices rather than emulators.Moreover, we employed 420 static and dynamic features and achieved better performance than existing frameworks.To the best of our knowledge, this is the first work that extensively investigates Android malware on real devices using over 30,0 0 0 applications, and presents evaluations with different input generation methods in order to measure the impact of their code coverage capabilities on the proposed DL-based malware detection approach.
Since our approach is based on feature extraction from real devices instead of emulators, the system is inherently robust against detection avoidance techniques aimed at emulators.Dynamic extraction from real devices also enables the system to overcome the limitations of static analysis e.g.dynamic code loading, obfuscation, data encryption, etc.It is also worth noting that some of the 420 features extracted are indicative of the malware incorporating these evasive behaviours, thus the deep learning system will be automatically equipped with the ability to learn how to detect malware with these behaviours during the training phase.

Methodology and experiments
In this section, we describe the methodology and the experiments which were conducted in order to evaluate the performance of the DL-Droid approach using real phones and two different test input generation methods: Stateless and Stateful.

Experimental setup
An automated platform is needed to run Android apps and extract their features.These features will be used as inputs for DL-Droid's deep learning based classification in order to detect Android malware.Since our aim is to investigate the performance of DL-Droid through several experiments, we utilized the DynaLog dynamic analysis framework described in Alzaylaee, M.K., Yerima, S.Y. and Sezer, S. (2016) .
DynaLog is designed to accept and run a large number of Android apps automatically, launch them in sequence using either an emulator (an Android Virtual Device "AVD") or a real phone, and log and extract several dynamic features (i.e.API calls, Action/events).With the dynamic analysis of Android apps, test input generation is needed in order to ensure sufficient code coverage to trigger the malicious behaviours.DynaLog is capable of utilizing different test input generation methods including: stateless (random-based) (using the Monkey tool Developers ( 2012) ), stateful (using DroidBot Li et al. ( 2017) ), and hybrid-based (which combines stateless and stateful input generation tools Alzaylaee et al. (2017) .The stateless approach is the most popular input generation approach and have been used extensively by researchers in this field.In fact, most existing dynamic analysis platforms for Android malware detection utilize a stateless approach (based on the Monkey tool).A previous study Yerima et al. (2019) , compared the performance of stateless, stateful and hybrid-based input generation on various machine learning classifiers.In this study, the stateful approach is proved to be more robust and enabled greater code coverage than the hybrid-based input generation.Therefore, in this paper, we used only stateless (Monkey) and stateful (DroidBot) input generation for our experiments with DL-Droid.
The Monkey tool generates pseudo-random streams of events because of a pseudo-random number generator which is controlled by a seed.A pseudo-random stream of events is still a random approach since the event selection is not based on a pre-determined pattern, even though it is configurable.It is important to note that random here refers to selection of next event to be executed i.e. no specific pattern is followed, unlike in the stateful approach where the event to be executed is chosen based on evaluation of the current state (i.e. the user interfaces state at a particular time).Monkey is based on a stateless approach and this is the most important difference that distinguishes the approach from the stateful Droid-Bot.We propose the stateful approach the default component of the DL-Droid framework, while we utilize the stateless method as a baseline for comparative analysis in this paper.Alzaylaee et al. (2017) have shown that dynamic analysis done on real devices is more efficient than using emulators.Thus, our experiments are completely based on real phones.Eight different phone brands were used with the following configurations: Android 6.0 "Marshmallow", 4GB RAM, 2.6Hz CPU, 32GB ROM and 32 GB of external SD card storage.Each smartphone processed an average of 100 apps daily.The SD cards contained a folder full of different resources such as pictures, text files, videos, sound files, etc. to simulate a typical phone.Moreover, each phone was equipped with a sim card containing call credits to enable 3G data usage, send text messages, and even make a phone calls when requested.The phones were also connected to an internal WiFi service in order to enable tested applications to connect with their external servers when necessary.
The executed runtime was different on each run and determined by the chosen test input generation.The required timing was confirmed after evaluation with several apps to determine how much time was needed to trigger every possible event using either the stateful tool (DroidBot), or the stateless tool (Monkey).For the stateful method, 180 s was found to be sufficient.For the stateless generation using Monkey, 300 s was enough to generate 40 0 0 events for the apps.Beyond 40 0 0 events, most apps did not generate any further dynamic output from Dynalog.The overview of the DL-Droid process using DynaLog as well as the DL classifier engine is shown in Fig. 1

Dataset
For the purpose of evaluating DL-Droid accuracy performance and to compare it with other popular machine learning classifiers, we used a dataset consisting 31,125 Android applications.Out of these, 11,505 were malware samples while the rest were 19,620 internally vetted benign samples obtained from Intel Security (McAfee Labs).These samples consist of a variety of app formats, including paid apps, powerful utility apps, banking apps, media player apps, and popular game apps.The samples are available to other researchers on request.

Features extraction and preprocessing
In the feature extraction phase, each application is installed and run on one of the eight phones using DynaLog ( Alzaylaee, M.K., Yerima, S. Y. and Sezer, S. 2016 ).Once completed for each of the two scenarios, Stateless and Stateful, the logged features are preprocessed into text files of feature vectors representing the features extracted from each application.These text files were further processed into a single.csvfile for each scenario with the purpose of evaluating the detection performance using deep learning.The.csv is an acceptable file format for both H2O flow and WEKA  which were used later in the experiments.Note that, each feature in the.csv file is binary containing either '0' or '1' which represents the absence or presence of each extracted features.
Originally, DynaLog was equipped to extract 178 features dynamically (i.e.API calls and intents -Actions/Events).These features were ranked using InfoGain (information gain provided by WEKA), and then the top 120 features were selected for the experiments.Dynalog was extended to enable extracting Android permissions statically prior to each dynamic run.This step allows us to test the detection performance of DL-Droid with more features as a bonus.
Over 300 Android permissions that have been used by the investigated Android apps that govern access to different device hardware and system resources.These permissions are considered as either normal, signature, or dangerous permissions.This step allowed us to collect the most relevant permissions which some of which were relatively new and had not been used in previous works.Hence, as shown in Table 1 , we obtained a total of 420 features from our feature extraction phase.The 420 extracted features were ranked using the information gain (InfoGain) feature ranking algorithm in WEKA.The top 20 ranked dynamic features (including and excluding the extracted permissions) based on InfoGain in both test input scenarios (stateless and stateful) are shown in Tables 2 -5 respectively.

Features ranking comparisons
From Tables 2 , and-5 , it is interesting to note that the API calls methods getDeviceId, getSubscriberId, getLine1Number, and get-SimSerialNumber from the TelephonyManager class, that provides access to information about the telephony services on the device, were among the top 20.However, the InfoGain score is higher for these features when extracted using DroidBot-based stateful input generation.For example, the InfoGain score of the feature TelephonyManager;-> getDeviceId is 0.099 in Tables 2 and 3 using DroidBot based stateful input generation, whereas the score for the same feature using stateless Monkey based random input generation is 0.075 in Tables 4 and 5 .
Similar findings can be seen with the feature TelephonyManager;-> getSubscriberId which scored 0.057 using DroidBot based input generation, while the score is 0.042 for Monkey based input generation.The feature action.SMS_RECEIVED scores 0.096 for the DroidBot based generation in Table 2 , which is higher than the score for the same feature extracted using Monkey based generation in Table 4 Hence, this indicates that the stateful DroidBot based input generation method has triggered more behaviours than the stateless Monkey based random input generation.Note that most existing dynamic analysis on Android utilize the Monkey tool for input event generation.

Investigating Deep Learning Classifier vs. other popular machine learning algorithms
Our main goal is to build a model for DL-Droid to enable accurate classification and detection of Android malware from benign apps.In our experiments, we train our deep learning classifiers on a classification problem with two labels, benign or malicious.We utilize H2O which currently supports only the Multilayer Perceptron classifier (MLP) Candel et al. (2016) .A confusion matrix is performed in our system to evaluate the effectiveness of different classifiers.The second phase of the experiments compared the performance between the proposed DL and seven popular machine learning approaches proposed in the literature.The classifiers include: Support Vector Machine (SVM Linear), Support Vector Machine with radial basis function kernel (SVM RBF), Naive Bayes (NB), Simple Logistic (SL), Partial Decision Trees (PART), Random Forest (RF), and J48 Decision Tree.We also investigated the performance of each classifier for two different test input generation methods.The results of our experiments are presented in section III using the performance metrics defined as follows: The true positive ratio (TPR) also known as recall, true negative ratio (TNR), false positive ratio (FPR), false negative ratio (FNR), and precision are defined as follows: T P R = T P T P + F N (1) (2) Where TP denotes the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.FM is the F measure calculated for both malware and benign classes.The combined measure known as weighted FM is defined as follows: Where F b and F m are the FM of the benign and malware datasets respectively, whereas N b and N m are the number of sam-ples in the benign and malware datasets respectively.The 10-fold cross validation approach was used in all of the presented experiments.

DL comparisons with dynamic features: Stateful vs. Stateless input generation
Table 6 depicts the results of experiments undertaken to evaluate the performance of the DL approach with different combinations of hidden layers.The results shown here is for the dynamic features only, using the stateful Droidbot input generation tool.22 different combinations of hidden neurons, containing two, three, and four layers, have been applied in order to determine the best possible performance based on the w-FM.At Table 6 , the results show that the 20 0, 20 0, 20 0 combination performs the best when compared to other combinations, with running time of nine min- utes.We can see that DL can achieve w-FM of 0.963 when setting the number of layers to 3 and selecting 200 neurons in each layer with dynamic features only.
We repeated the same experiments on the dynamic features extracted using the stateless Monkey based random input generation tool in order to compare the results with the previous scenario.Table 8 shows the results obtained.The best w-FM is also recorded with three layers similar to the previous scenario.However, this is obtained with different combination of neurons.The number of neurons in each layer is 30 0, 10 0, 30 0 respectively for the best w-FM of 0.958.Even though the running time is 8 minutes, which is less by almost one minute, our focus has been the detection accuracy.Therefore, from Tables 6 and 8 , we can confirm that the DL-Droid achieves its best performance with the features obtained from the use of the stateful input generation approach.

DL comparisons with dynamic features and static features: Stateless vs. Stateful input generation
The same experiments outlined in the previous section were repeated with the addition of static features, i.e. permissions, and results are shown in Table 7 We can see that the same combination of 200 neurons in each hidden layer with three hidden layers is superior to the other deep networks for Android malware detection using the stateful input generation approach.The w-FM reached approximately 0.99.

Comparison of the performance of the Deep Learning Classifier with other popular machine learning classifiers
In this section, we compare the detection accuracies of the proposed DL approach with the most popular machine learning algorithms as shown in Tables 10 and 11 .Overall seven machine learning algorithms were selected based on results of several preliminary experiments had been conducted.From the tables, we can clearly see that the proposed DL approach outperforms the other machine learning algorithms.In Table 10 , where results from only dynamic features are presented, the second highest w-FM of 0.94 is achieved by the Random Forest algorithm, while that of the deep learning approach is 0.963.When we perform further comparison by adding permissions to the analysis ( Table 11 ), the DL approach still topped the rest with a w-FM of nearly 0.99, while the next highest, which is again Random Forest, achieved a w-FM of 0.97.We can clearly observe that the addition of static features i.e. permissions improved the detection accuracy of DL-Droid.
Fig. 2 , presents the results of comparison between the two input generation methods i.e. stateful (using Droidbot) and stateless (using monkey).Fig. 2 shows the w-FM results for DL-Droid as well as the selected seven popular machine learning algorithms.
In the experiments with dynamic features only, all classifiers except for NB and J48, performed better where stateful input generation with Droidbot was utilized, compared to the stateless approach using Monkey.However, in the experiment with combined static and dynamic features, the stateful input generation approach was superior for all the classifiers.With these results depicted in Fig. 2 , we can conclude that DL-Droid with stateful input generation (our initially proposed approach) achieves the best detection accuracy.

Results comparison with existing work
Table 12 , presents a comparison of DL-Droid performance with other existing deep learning based methods for Android malware detection.DroidDetector's static and dynamic based deep learning method achieved 96.76% accuracy compared to DL-Droid which has 98.5% accuracy.DL-Droid outperformed DroidDetector ( Yuan et al., 2016 ) in all other metrics, while utilizing more samples for the experiments.DL-Droid also outperforms Maldozer ( Karbab et al., 2017 ), Deep4MalDroid ( Hou et al., 2016 ), AutoDroid ( Hou et al., 2017 ) and the CNN approach presented in ( McLaughlin et al., 2017 ).It is interesting to note that, just like in Deep4MalDroid and Auto-Droid, the number of the optimum hidden layers for DL-Droid is three.

Conclusion
In this paper, we presented DL-Droid, an automated dynamic analysis framework for Android malware detection.DL-Droid employs deep learning with a state-based input generation approach as the default method, although it has the capability to employ the state-of-the-practice popular Monkey tool (stateless method).We evaluated DL-Droid using 31,125 Android applications, 420 static and dynamic features, comparing its performance to traditional machine learning classifiers as well as existing DL-based frameworks.The presented results clearly demonstrate that DL-Droid

Table 1
Total number of the extracted features used in the experiments.

Table 2
Top 20 Ranked Features based on InfoGain using Stateful input generation DroidBot (Permissions excluded).

Table 5
Top 20 Ranked Features based on InfoGain using stateless Monkey based input generation (Permissions included).

Table 6
Deep learning results with different combinations of hidden layers (with the use of stateful input generation and dynamic features only).

Table 7
Deep learning results with different combinations of hidden layers (with the use of stateful input generation and static + dynamic features).

Table 8
Deep learning results with different combinations of hidden layers (with the use of stateless input generation and dynamic features only).

Table 9
Deep learning results with different combinations of hidden layers (with the use of stateless input generation and static + dynamic features).

Table 10
( Papamartzivanos et al., 2019 ) learning algorithms (with stateful input generation and dynamic features only).highaccuracyperformance reaching better figures than those presented in existing deep learning-based Android malware detection frameworks.To the best of our knowledge, this is the first work to investigate deep learning using dynamic features extracted from apps using real phones.Our results also highlight the significance of enhancing input generation for dynamic anal-ysis systems that are designed to detect Android malware using machine learning.As future work, self-adaptation such as introduced and investigated recently for Intrusion Detection systems( Papamartzivanos et al., 2019 )could be explored as a means of improving the performance of the deep learning based system for Android malware detection. achieved

Table 11
Results for DL and seven machine learning algorithms (with stateful input generation and static + dynamic features).

Table 12
Comparisons of DL-Droid with other existing deep learning approaches.