Android malware detection with MH-100K: An innovative dataset for advanced research

High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.


Specifications Table
• Furthermore, the provided data can serve as a performance benchmark for advancing stateof-the-art machine learning methods in the field of malware classification.

Data Description
The MH-100 K dataset is an extensive repository containing 101,975 Android samples, with 9800 categorized as malicious applications using a threshold of at least 4 positive scanners from VirusTotal analysis.The primary component of this dataset is a central CSV file containing essential metadata, including the SHA256 hash (representing the APK's digital signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents.Additionally, the MH-100K dataset encompasses a substantial collection of files that collectively store the results of the VirusTotal analysis [2] .
This dataset offers valuable potential for future research, enabling the analysis of trends in antivirus scan results to gain deeper insights into the presence and behaviour of diverse malware families.Such analysis can facilitate the development of malware taxonomies, the identification of novel variants, and the study of malware evolution over time.
A description of the MH100K metadata data is provided in Table 1 .The hexadecimal value of "SHA256" cryptographic hash is shown as a string.The "NAME" refers to the name of the application.The "PACKAGE" is a specific identification for each application.The "TARGET_API" denotes the API level against which the application has been tested, whereas the "MIN_API" identifies the lowest Android API level required for the program to function.
Table 2 includes examples of app permissions, which refers to a set of rules that grant or deny activities such as access to specific features or data.Android uses this system to ensure that apps may only access data and conduct actions the user has expressly authorized.The complete list of permissions is available in the Malware Hunter repository.These permissions ensure that apps have access to device capabilities or data.A second kind of feature is the intent actions, which can indicate application or system states such as completed boot or low battery.In Table 3 we provide a few examples of intent actions.Developers can register their apps to listen to these intents to ensure responsive and dynamic behaviour in line with the device's status or user actions.The complete list can be found in the repository.Table 4 shows a selection of a few API method calls from the Android platform, providing tools and functionality for developers to interface with various Android systems and device features.Android developers can utilize the methods and functions the Android system offers to create, enhance, or enable specific application features through API calls.Due to these calls, applications can engage with system components, access system resources, connect with other apps, or utilize specific functionality.In our dataset, we have 24,417 API calls.

Experimental Design, Materials and Methods
The entire dataset generation pipeline, including feature extraction, data labelling, and dataset creation, is implemented by two Python-based tools, AMGenerator and AMExplorer, which we developed explicitly for this purpose [3] .We used these tools to generate the dataset.
Fig. 1 provides an overview of AMGenerator (1) and AMExplorer (2).AMGenerator consists of three primary modules: Acquisition, Extraction, and Labelling.Conversely, the AMExplorer tool utilizes the data and metadata generated by each AMExplorer module to construct the dataset.The initial module accepts a list of cryptographic summaries for applications as input and proceeds to download each one.These applications were sourced from the AndroZoo repository [2] .Applications that were successfully downloaded are added to the feature extraction queue.If the download cannot be completed, the application is deleted and will not be used as a sample in the dataset.
During the second stage of the pipeline, features are extracted using AndroGuard [4] , a specialized tool and Python library designed to extract information from APK files.Each Android app is unpacked during extraction, leading to an intermediate code for analysis and extraction (e.g., building a graph of API calls).This module produces a data file containing information on the application's attributes, which is then forwarded to the dataset-building data queue.
The Labelling module acquires information about an APK from VirusTotal [ 5 ], enabling users to determine later whether an APK is benign or malicious.VirusTotal provide dozens of scanners.Users can define their own threshold (e.g., 5 scanners) for labelling an APK as malware.For instance, as of July 2023, VirusTotal provides access to over 60 scanners for APK categorization.Each scanner service yields a result for the APK, and given potential disagreements amongst different scanners, a decision-making process (e.g., majority rule) becomes necessary.Users can label samples based on arbitrary thresholds or leverage solutions such as Maat [ 6 ], which provides means for automatically labelling samples based on VirusTotal's metadata.
Lastly, the AMExplorer tool ( Fig. 1 ) explores and compiles output data from the three tool generator modules to create the final dataset.The dataset includes binary values (0 or 1) for commonly used Android malware classification features, such as permissions, intents, and API calls.
Malicious and Benign APK feature set Data collectionWe downloaded the APK files from Androzoo and the scanners' analysis from VirusTotal.Androzoo's data can be accessed through the following link: https://androzoo.uni.lu/access.The VirusTotal API overview is available at https://developers.virustotal.com/reference/overview.Data source locationThe data was collected from the extensive selection of Android applications accessible on Androzoo.The sampling duration extends over a span of 13 years, covering the timeframe from 2010 to 2022, resulting in a comprehensive global coverage.
1. Value of the Data• The dataset includes 101,975 Android malware samples, each represented by several features and class label, suitable for machine learning input.• This dataset is valuable for machine learning, data mining, and computer security researchers, as well as practitioners.Enthusiasts in computer security can leverage the data to improve malware detection systems.• This dataset can help in building more reliable classification models, offering easily extractable attributes that benefit both researchers and practitioners.

Table 1
List of metadata attributes and descriptions.

Table 3
Example of Intents included in MH100K dataset.

Table 4
Example of API calls included in MH100K dataset.