Detecting Android Malware by Analyzing Manifest Files

: The threat of Android malware has increased owing to the increasing popularity of smartphones. Once an Android smartphone is infected with malware, the user suffers from various damages, such as the theft of personal information stored in the smartphones, the unintentional sending of short messages to premium-rate numbers without the user’s knowledge, and the ability for the infected smartphones to be remotely operated and used for other malicious attacks. However, there are currently insufficient defense mechanisms against Android malware. This study proposes a new method to detect Android malware. The new method analyzes only manifest files that are required in Android applications. It realizes a lightweight approach for detection, and its effectiveness is experimentally confirmed by employing real samples of Android malware. The result shows that the new method can effectively detect Android malware, even when the sample is unknown.


Introduction
With the rapid entry of smartphones into daily lives, Android malware has been rapidly spreading.The Android operating system (OS) is an easy target for attackers, because the market share of Android has increased, and many Android applications are written in the Java programming language.
According to a global survey of the smartphone OS market, Android possessed 68.8% of the market share in 2012 [1], implying that the popularity of Android has undergone significant growth.It is easy for malware to infect Android smartphones because of the large number of phones.Moreover, Android applications are easy targets for reverse engineering, which is a specific characteristic of Java applications in general, and which is often abused by malicious attackers, who attempt to embed malicious program into benign applications, hence creating subspecies of existing malware.Yajin et al. [2] illustrated that 86.0% of Android malware are created by conversion from benign applications.Hence, Android is considered to be an easy target for malicious attackers, and therefore, the privacy and integrity of the user's data are seriously threatened.There have been numerous studies focusing on the detection of Android malware.One of the popular approaches includes signature-based methods, which extract signatures from malware samples.While it is effective for detecting known malware, it is inadequate for detecting unknown malware.Iland et al. [3] suggested a detection method at the network level.They observed network traffic originating from a sample application and tried to detect malware by comparing with DNS-based and IP-address-based blacklists.This method cannot detect unknown malware, because the blacklists are generated from known malicious activities.Isohara et al. [4] presented a method for detecting malware by analyzing attributes of files within sample applications.While this approach can detect only some unknown malware that are undetected by blacklist or signature-based methods, the analysis cost depends on the number of files within sample applications.Enck et al. [5] proposed a lightweight method to block the installation of applications that have dangerous permissions or intent filter (a mechanism for realizing cooperation between Android applications) combinations.However, the method may lead to incorrect detections, because the information used in the method is not sufficient to differentiate malware from benign applications.Wu et al. [6] developed a system to provide a static analysis paradigm for detecting malware, called DroidMat.They obtained some distinguishable characteristics such as permissions, components (essential functions such as Activity, Service and Receiver) and API calls by analyzing manifest files and smali files (disassembly codes).This system can discriminate between malware and benign applications.However, the cost of their analysis depends on the size and numbers of smali files.Our preliminary study measured the average file sizes and number of files that are the main resources in Android applications.Table 1 and 2 show the results.We investigated 30 benign samples and 30 malware samples.From Table 1 and 2, we can observe that the cost of analyzing smali files is higher than that of manifest file.
This study proposes a new method for detecting Android malware by analyzing only manifest files.Each Android application must have a manifest file, which presents essential information about the application.Our preliminary investigations revealed that there are certain differences between the manifest files of benign applications and malware.Our proposed method is based on the characteristic analysis of Android manifest files and is effective for detecting well-known existing malware and unknown malware.Moreover, the cost is low, because this method analyzes only a manifest file.Table 1 and 2 show a manifest file is usually a small file.
The remained of this study is organized as follows: Section 2 proposes our new method.Section 3 describes the experiment used to demonstrate the effectiveness of the new detection method.Section 4 concludes this paper and discusses the possible future extension of the proposed method.

New Method for Detecting Android Malware
This study proposes a new method for detecting Android malware by analyzing only manifest files.Android applications consist of the following resources: a manifest file, application programs for Dalvik virtual machine (VM), and application resources.Figure 1 shows an Android application package (.apk), which includes a manifest file.

Figure 1. Android application package (.apk).
The manifest file takes the form of "AndroidManifest.xml,"which must be present in all Android applications.Application programs are collected as "classes.dex."Application resources consist of pictures, music, and some xml files, which describe the layout information.
Android malware is detected by the following steps: [ Step 1] Extract specific information described in the manifest file of a sample application.[Step 2] Compare the extracted information with the keyword lists that are provided in this new method.Then, calculate the malignancy score of the sample by comparing the information in Step 1 with the lists.[Step 3] Compare the malignancy score in Step 2 and the threshold values, which are set by this new method.If the malignancy score exceeds the threshold value, the sample is judged as malware.Figure 2 shows the flow of the new detection method.

Extract information items
Manifest files have essential information about Android applications, such as the version number of an application, the name of a package, required permission, and the API level.The format of the manifest file is identical in both benign applications and malware.However, there are certain differences in the characteristics of several information items.We investigated 30 benign samples and 30 malware samples, giving a total of 60 samples.We selected specific information items that show a wide variety of malware as compared to benign applications.Table 3 shows six information items that are extracted from manifest files and that are used to detect Android malware by the proposed method.The items are represented as text strings or numbers.
Table 3. List of information items. (

Keyword lists and malignancy score
With this new method, several keyword lists are compiled for an application.Benign or malicious strings in a manifest file are recorded in the keyword list.We make four types of keyword lists: (1) Permission, (2) Intent filter (action), (3) Intent filter (category), and (4) Process name.Because (5) Intent filter (priority) and (6) Number of redefined permission are represented by an integer, and not a text string, we have no keyword lists for them.Figure 3 counts the number of keywords, which are classified as "Permission" items.Step 1 Step 2 Step 3 30 benign samples 30 malware samples Figure 3 shows the occurrences of popular permission keywords.This figure shows that the permissions which are related to short message service (SMS), such as SEND_SMS, RECEIVE_SMS, and READ_SMS, are frequently used by malware samples.These permissions are registered with the keyword list as malicious strings.A similar process is also performed for (2) Intent filter (action), (3) Intent filter (category), and (4) Process name.We have four keyword lists, which are shown in Table 4.Most keywords are considered to be malicious, while some are classified as benign.After we obtained the keyword lists, the malignancy score for the above four information items are calculated.This process is performed by classifying the keywords as being benign or malicious.The malignancy score is calculated by formula (1).
where P: malignancy score, M: number of malicious strings, B: number of benign strings, E: number of total information items.Table 5 shows an example.This sample uses five permissions items.Table 5. Permissions keywords in a sample.
Similar calculations are also performed for (2) Intent filter (action), (3) Intent filter (category), and (4) Process name.With regards to (5) Intent filter (priority), the set-up value is counted and used for the judgment in Step 3. (6) Number of redefined permission is also counted and considered.

Thresholds and judgment
The proposed method provides threshold values for the malignancy score.We use a data mining tool, Weka [7], to determine the threshold values.As with the four categories of information items (1), ( 2), (3), and (4), the threshold values are set using the Weka J48 algorithm, which is based on a decision tree.We use both benign samples and malicious samples for machine learning.Specific samples are explained in Section 3.With regards to the threshold value for items (5) and ( 6), we set the threshold value at 1000 for (5) and 3 for (6) based on the result of our preliminary analysis, which was described in Section 2.1.
Judgment for an application sample is performed on the basis of conditions 1, 2, and formula (3), which are given below.Condition 1 describes the characteristics of malware.Condition 2 is made to avoid incorrect judgments.In formula (3), the SCORE refers to the final malignancy score of the sample.C1 and C2 count the number of items satisfied by a sample in condition 1, and condition 2, respectively.
If the final SCORE is greater than or equal to 1, the sample application is considered to be malware.

Experiment
To evaluate the performance of the proposed method, we conducted the following experiment with Android application samples.

Overview of the experiment
We collected 235 benign application samples and 130 malware samples.Benign samples were collected from Google Play [8] and some related markets.Malware samples were obtained from a web site that provides samples for research purposes [9].All samples have a unique MD5 hash value and are classified into two groups: "Learning data" and "Test data."Learning data is used to determine the suitable threshold values used by Weka, and the keyword lists are the same as in Table 4. Test data is used to evaluate the proposed new method.In this experiment, the samples are first analyzed by VirusTotal [10], which is an on-line scanning tool for malware.We classified a malware sample into the Learning data if the first registered data is before September 2011.The remaining malware samples are used for Test data.This date is selected to enable the acquisition of a sufficient number of malware samples for learning and testing.We can treat malicious Learning data as known samples and malicious Testing data as unknown samples.Note that malicious Testing data include samples that are not detected by signature-based methods.Incidentally, benign Learning data and Test data were randomly selected from the collected benign application samples.Table 6 shows the number of samples that were used in this experiment.

Result of the evaluation
Table 7 shows the result of the experiment.It shows that the correct ratio of detecting benign samples is 91.4%, detecting malware samples is 87.5%, and it is 90.0% in total.This result indicated that the proposed method can accurately classify Android applications.The samples that are used as Learning data consist of only those whose first detected time is earlier than that of any Test data samples.Therefore, the proposed method is shown to successfully extract essential information from manifest files, although it only learns from old samples for which the first detected times were before September 2011.Therefore, it can detect unknown malware samples successfully.

Discussion
Some malware samples were not detected by the proposed method.We found that the proposed method was inadequate for detecting adware samples.In addition to actions that display some advertisements superfluously, there is often a marginal difference between a benign application and adware.This means that both manifest files appear to be similar, and it is difficult for the proposed method to effectively detect adware based on the manifest analysis.

Conclusion and Future works
This paper proposed a new detection method for Android malware.The advantage of this new method is that it uses only manifest files to detect malware.Manifest files are required in all Android applications, and thus, the proposed method is applicable to all Android applications.Our results show that the proposed method can detect unknown malware samples that are undetectable by a simple signature-based approach.Moreover, the cost of analyzing only the manifest file is extremely low.The new method can also be combined with other methods to realize an even more precise detection method.
Our evaluation uses only a small number of samples; only 365 samples in total.In future, we plan to collect additional samples to obtain more precise results for the evaluation experiments.
The proposed method extracts six types of information from manifest files and uses them to detect Android malware.The essential information items can be easily changed, and we should closely observe trends in Android malware to determine whether to keep or revise the effective information items in the manifest file.

Figure 3 .
Figure 3. Permission keywords in each set of 30 samples.

l
Malignancy score is greater than the threshold value determined by Weka.l Count of Intent filter (priority) is greater than the threshold value.l Count of redefined permissions is greater than the threshold value Condition 2: l Malignancy score of (2) Intent filter (action) is negative (< 0) l Malignancy score of (3) Intent filter (category) is negative (< 0) Criteria formula:

Table 2 .
Average : number of files.

Table 6 .
Number of samples used in the experiment.

Table 7 .
Result of the experiment.