Identifying Major Research Areas and Minor Research Themes of Android Malware Analysis and Detection Field Using LSA

s of 444 Research Papers Figure 4: Dataset preparation using 3C’s Formula. Complexity 5


Introduction
Data are ubiquitous, whether they are on blogs, social media platforms, discussion forums, reviews, literature, or research studies. Extracting information out of such multidimensional data is not only important but is challenging too. ere is a paradigm shift in knowledge transfer among different subareas of the research held. Manual systematic reviews [1] or semiautomated [2][3][4] are two methods that can be employed for systematic reviews. Manual reviews are more critical and can be biased [5]. e selection of focus area, attribute selections, and interpretation entirely depends on the expertise of the reviewer. Elaborating present trends and forecasting future directions from the existing literature is not only challenging but also time-consuming for systematic manual reviews. In contrast, semiautomated methods are more generic in finding the trends [6]. Deployment of machine learning techniques within semiautomated review methods can facilitate researchers to gain a dynamic review of any literature of choice.
is manuscript offers an empirical overview of contemporary machine learning methods, which have the potential to expedite evidence synthesis within research literature using Simulating Expert comprehension for Analyzing Research trends (SEAR) framework. SEAR deploys humanlike intelligence to manage knowledge and information effectively. e framework leverages information modeling techniques to simulate how humans read, understand, interpret the meaning of words, and map the semantic relationship in text. e proposed SEAR framework has been deployed as TRENDMINER. As a use case, a corpus pertaining to Android security was used. During the last decade, pieces of malware are propagating at a tremendously high rate using persistent and sophisticated techniques [7]. is situation has led researchers to devise various analyses, detection, and mitigation methods, resulting in building a substantial body of literature. Continuous ongoing research augmentation of the Android platform and malware has resulted in humongous literature. is research literature has offered numerous research prospects and has promulgated contemporary challenges within the domain. To the best of our knowledge, there is no literature investigating those challenges and research directions using semiautomated machine learning-based methods. Unlike previous works, this study is far beyond any generic study on mobile attack vectors or defense [8][9][10][11]. Instead, it oriented around emerging research trends and also suggested future directions using quantitative semiautomatic approaches. With respect to the technique being employed and dataset chosen, this study intends to answer the following research questions as framed by the research community [12]:  Table 1. LSA has found to be appropriate for this work as it was successfully deployed by various researchers to analyze research trends in domains such as Volunteered Geographic Information [13], Building Information Modeling [6], Supply Chain Management [14], and OpenStreetMap [5]. Several studies have demonstrated the validity of LSA in constructing a framework that leverages semantic-driven analysis to recognize and infer information from the content. Semantic-driven analysis understands the text structure, words, and the topic discussed in the document [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29]. LSA is dependably effective in data recovery and question streamlining. It recognizes a whole of the settings where a word could show up and figures out how to set up a typical factor to address basic ideas. Examination in brain science proposes that LSA mirrors the human brain to sift through semantics from the content. e authors in [30] proposed a method called word2vecbased LSA as a new topic modeling technique to study the trend analysis in blockchain technology. eir proposed methodology was composed of neural network-based word embedding and spherical K-means clustering. ey also discussed the downside of traditional methods such as bibliometric and frequency-based analysis. ey also compared their results with PLSA. In their findings, PLSA is not successful in capturing the context of the document whereas their proposed methodology was able to capture the context on real data. e authors in [31] reviewed various theoretical aspects of LSA and spatial models. ey discussed various characteristics and properties empowering LSA as a suitable topic modeling technique. ey also revealed some limitations and misunderstandings related to LSA. ey argue that LSA has traveled a lot in providing good results as compared to other models. As a future scope, they mentioned that the fusion of different models tends to produce a coherent ecosystem. e authors in [32] performed text mining using LSA and nonnegative matrix factorization (NNMF). ey discussed the strengths of LSA to process the highly sparse term-document matrix with less computation overhead. ey discussed the stability of results and clustering performance while deploying LSA in their methodology. ey also integrated K-means for cluster formation in their proposed methodology. In [33], the authors utilized LSA as an application to determine the memory reconstruction. LSA was applied to test that sleep reduces the semantic coherence of memory recall. In [34], the authors attempted to deploy kernel matrix estimation using LSA to increase the sharpness of the blurred image. e authors in [35] defined the applicability of LSA in determining problems in aerospace science. e authors in [36] utilized the LSA to extract the features across different knowledge domains such as information systems and operations management. In [37], the authors studied the impact of technology-enhanced learning in higher education. e topics were discovered and analyzed from the corpus related to technology-enhanced learning. e authors in [38] proposed a new taxonomy and future research directions in industry 4.0 using LSA. Various research themes related to the field were discovered and discussed.
Android security is an interesting area to explore. Malware authors tend to plant malicious code matrices inside legitimate applications to unlock their unscrupulous motives. A continuing thread of malware proliferation had let the research community perform various studies related to Android malware detection and analysis techniques. e traditional methods such as bibliometric analysis or frequency-based analysis focus on the quantitative analysis but not on the qualitative analysis [30]. ese approaches are highly effort-demanding and time-consuming to perform trend analysis. e authors need to perform a full-text investigation to study the trends in the Android security field [39][40][41]. ese approaches did not reveal the insights of the literature as they consider limited databases with limited time frames. e topic modeling techniques such as Latent Semantic Analysis (LSA) had confirmed their usefulness in determining comprehensive and detailed trend analysis. Studies in [42][43][44][45] have witnessed the use of topic modeling to identify research trends to a great extent and shown advantages over traditional methods. Table 1 shows the comparison of LSA with other topic modeling techniques. LSA focuses on revealing the diverse topics that emerged during the given timeline and provides a quantitative and qualitative evaluation. e results produced by LSA help the practitioner to pursue various potential research opportunities. LSA is used on top of this matrix to drastically reduce the vector size and capture latent topics in the corpus, while being able to infer relationships between relevant terms and respective documents, without any loss of context. e remainder of the paper has been arranged as follows: Section 2 depicts the brief introduction to the SEAR framework. Materials and methods are discussed in Section 2 Complexity 3. Section 4 discusses the research questions and examines potential future research directions. Section 5 examines the outline of the proposed solution as an implication of future examination while Section 6 discusses the limitation of the investigation. Conclusions and findings are discussed in Section 7. Section 8 discusses the practical implications and future avenues of the research.

Proposed SEAR Framework
e proposed SEAR framework operates in the sequence as given in Figure 1.
Step 1: this step involves data gathering methods, creation of repository and XML parser, and conversion of documents to text files.
Step 2: this step involves data preprocessing of the corpus. Stop words and punctuations should be removed from the dataset, and it should be normalized before performing any text mining task.
Step 3: this step implements the TF-IDF and SVD technique, discussed in the further sections.
Step 4: this step involves the identification of core research areas and research trends. It also focuses on the mapping of research trends with the research area. e SEAR framework utilizes a semantic analysis technique called LSA. It is a well-established algorithm to convert unstructured raw textual data into organized information objects and further analyze these objects to recognize patterns for the revelation of learning [2,46,47]. It employs a systematic and comprehensive approach to uncover the research trends in a vast literature dataset [3,21,24,25,[48][49][50][51][52]. is study aims to map the semantic relationship between documents and terms in a large corpus to reveal the varied contextual latent classes using LSA. e steps in applying LSA to the Android security corpus is identical to previously reported studies [3,51,[53][54][55][56][57]. e following sections discuss the detailed procedure of this study.

The Use Case of SEAR Framework: TRENDMINER
TRENDMINER is the use case of the SEAR framework which takes text documents as an input, as shown in Figures 2 and 3. e dataset of 444 abstracts is considered sufficiently large enough for performing text mining, as explained in [3]. Python 3.7 programming language was used to perform all the experimentation. Table 2 shows the software versions used in our work. e machine used for the experimentation was configured with Intel Core i5 6200U with 2.4 GHz and 8 GB RAM. Once the literature dataset on Android security is successfully uploaded on TRENDMINER, it is further fed to Latent Semantic Analysis (LSA), which is a backbone of TRENDMINER. LSA is a text data mining and natural language processing technique used to retrieve and query a massive corpus of literature [51,56,58]. As a scientific and measurable strategy, LSA is utilized to recognize the latent concepts inside the textual data at the semantic level [59][60][61][62][63].

Step 1: Data Acquisition.
is section reveals the keywords, search strategy, and selection criteria used for preparing the large corpus. Reputed databases were used for the collection of research articles on Android security. Inclusion and exclusion criteria were applied to refine the searching results to get relevant research articles. e repository was made to achieve standard uniformity across the research articles.

Task A: Dataset Preparation.
e first task was to prepare the literature dataset for TRENDMINER. e approach followed for collecting the literature dataset is primarily focused upon the structure of Android applications, the probable vulnerabilities within existing application development along the methods adopted for malware identification and mitigation. e strategy adopted for searching and selecting literature is defined by 3C's Formula, depicted in Figure 4: Unable to perform document level modeling.
(i) Automation in essay grading (ii) Automation in question recommendation.

Latent Dirichlet Allocation
Provides multinomial distribution across words and Dirichlet distribution over topics. Capable of handling long-length documents.
Cannot predict relations among topics.

Correlated Topic Model
Uses logistic normal distribution for topic clustering. Produces topic graphs also.
Complex computation is involved in its processing. Too many generic words may lead to inefficiency.

Data Analysis
Information modeling Technique + Machine Learning Step 1 Step 2 Step 3 Step 4 •

Complexity
(1) Component 1: Keywords. e articles were selected using keywords such as "malware," "vulnerability," "security," "privacy," "monitoring," "application," "smartphone," "android," "virus," "static," "dynamic," "detection," and "data flow." (2) Component 2: Search Strategy. e TRENDMINER considered reputed research from prominent databases such as IEEE Xplore, ACM Computing Library, Science Direct, Springer, and Google Scholar which was queried to collect high-quality papers on Android malware analysis and detection techniques. Scopus indexed articles from prominent databases were duly included while searching the literature. Figure 5 illustrates the proportion of Scopus indexed articles in our corpus.
(3) Component 3: Selection Criteria. Raw results from the databases mentioned above were refined based on the Android operating system. Papers on operating systems such as Symbian and iOS were discarded.

Task B: Creating a Repository for TRENDMINER.
Mendeley, a tool from Elsevier [64], has been used to build the literature database. It provides a systematic way to retrieve the authors, years, and abstracts of all the research papers indexed into its file system and also to export all of them as citations and XML tree structures. Parsing of resultant XML tree structure was one of the significant challenges during this study. A consistent naming convention for the whole literature dataset was necessary. Renaming the articles using particular objects common to all research documents will have a significant impact on their future ease.
A module in TRENDMINER was developed, known as XML Parser. e purposefully generated XML corpus was further parsed to a more structured format, i.e., comma-separated values (CSVs). Figure 6 shows the generic conversion process flow. e exported files consist of metadata information such as authors, year of publication, and publishers. e following observations were made during the prelim analysis of the corpus. Based on the number of occurrences in the dataset, the top researchers with the most publications on Android security during the period 2010-2019 were calculated and are presented in Figure 7. Figure 8 shows the top fifteen journals publishing articles related to Android security. Figure 7 interprets that the top authors were Wang, Xiaofeng, and Jiang, Xuxian, with 13 publications, with Zhou, Yajin closely following on 12. e graph obtained was from the analysis performed on the dataset chosen, as described above. Figure 8 identifies Computers and Security (Elsevier) and IEEE, as the top publishers, publishing research in the Android malware and security field. NDSS, Springer, and ACM closely follow them.

Task C: Parsing the PDF Documents to Text.
Conversion of pdf to text was subsequently performed to make the dataset input ready, compatible with TREND-MINER. Various tool options for the conversion process are available, namely, PDFMiner, Tika, and Textract. PDFMiner [65] was opted in the experimental study because of the following significant benefits: (i) PDFMiner can obtain the exact location of text on a page along with information such as fonts or a number of lines. (ii) It facilitates the conversion of PDF files into other text formats (such as HTML). (iii) It provides accurate results even under extreme conditions such as parsing large corpus.

3.2.
Step 2: Preprocessing the Text Files. After the successful conversion to text files, the next step was to employ preprocessing procedures. e preprocessing module in TRENDMINER helps to gain quality information out of the text by applying appropriate preprocessing techniques. For any text mining algorithm, the preprocessing of the collected dataset is an essential step [66,67]. is involves the expulsion of names, numbers, abbreviations, slang, acronyms, punctuation, and N characters as recommended in [3].
Preprocessing of corpus involves the execution of the following procedure, developed in Python platform using NLTK package. NLTK is Natural Language Toolkit [68].

Task A (Tokenization).
In this step, large chunk of text was tokenized into sentences, then sentences into words.

Task C (Normalization).
Normalization is applied over the words to introduce uniformity and maintain consistency among the text documents. e task of normalization is composed of several subtasks such as removing punctuation from the text, changing overall content to a similar case either uppercase or lowercase, and converting numbers to words. Normalization helps to keep all words on equivalent balance to allow smooth processing of the textual data.

Task D (Stemming and Lemmatizing).
For further processing of documents, the dictionary size has to be reduced and should be populated with unique words. Stemming and lemmatizing are the techniques that are performed over the words to reduce the inflection. e idea is to reduce the words to the common root form. In stemming, base form is known as stem while in the case of lemmatizing, it is known as a lemma. Stems might not be actual or real words, but on the other hand, lemmas are the actual language words. ese two techniques help in achieving faster processing of text documents.

Task E (Character Filtering).
All words less than length 4 were omitted [3].
It is to be noted that the initial dataset contained 60,184 tokens which represents the length of the vocabulary in the entirety of the corpus. Before the dataset is fed to other computational steps, it has to be nonredundant and free from any kind of noise. After applying appropriate preprocessing procedures as discussed previously, the word list was retained with 1944 tokens. In this study, 444 documents and the resulted wordlist of the 1944 tokens represent columns and rows, respectively. A term frequency is created where each term maps to a count of occurrence in each document. Furthermore, this matrix is transformed into a weighted matrix using the TF-IDF weighting scheme.

Step 3: Data Analysis Using Information Modeling and Machine Learning Techniques.
is work makes use of the information modeling technique to expedite the data analysis process over the corpus. With the conjunction of information modeling and machine learning techniques, human interpretable topics can be extracted from a document corpus. Machine learning approaches enhance the ability of information modeling techniques by allowing researchers to intelligently extract and manage the crucial    6 Complexity information to make smart decisions. Deploying Latent Semantic Analysis (LSA) as the information modeling technique can automatically identify topics and unveil hidden patterns in the vast corpus of data. LSA uses the matrix method called Singular Value Decomposition (SVD) to construct a low-rank approximation from extensive matrix data. SVD is the major strength of the LSA and one of the basic machine learning algorithms. It reduces the dimensions of the data without losing a significant amount of information. e main idea is to apply LSA on a document set and unsupervised machine learning approach on a reduced dimension set to group similar documents according to their topic areas. K-means, which is the unsupervised machine learning approach, fitted in the LSA model to uncover the latent structure of the corpus.

Task A: From Documents to Matrices-TF-IDF (Term Frequency Inverse Document Frequency).
In this study, a mapping needs to be investigated from the documents to the latent topics that they all relate to. For that, the most important words were to be identified which can later lead to the latent topic discovery. e TRENDMINER leverages the essence of the technique, called Term Frequency Inverse Document Frequency (TF-IDF). ere are other weighting methods are available for the analysis. e most common weighting schemes are TF-IDF and log-entropy. As per the study in [3], a potential weakness of log-entropy was discovered and it proved to be biased towards high-frequency terms in the dataset. For instance, log-entropy produces a better result with article titles or documents with a short text. TF-IDF performs better in discovering the patterns in large semantic spaces of larger groups of terms. Motivated by this finding, we utilized the TF-IDF technique as the weighting method in the study. e Latent Semantic Analysis (LSA) topic model algorithm requires a document-term matrix as the main input. TF-IDF helped in maintaining a document-term matrix that described the frequency of terms that occur in a collection of documents. e documents and words in a matrix correspond to columns and rows, respectively. TF-IDF has widely been into usage for better topic analysis [  resulting document-term matrix of the example stated in the previous example is presented in Table 3.
(1) TF (Term Frequency). It processes the standardized Term Frequency (TF), which is determined as the frequency a term shows up in a report, separated by the complete number of terms in that record, refer to equation (1). TF matrix is shown in Table 4: number of occurences of term t appears in document d total number of terms in the document . (1) (2) IDF (Inverse Document Frequency). It estimates how significant a term is. IDF is processed as the logarithm of the quantity of records in the corpus isolated by the quantity of reports where the particular term shows up. Nonetheless, it is realized that specific terms, for example, "is," "of," and "that" or space explicit words, may seem a great deal of times however have little significance. In this way, there is a need to overload the continuous terms while increasing the uncommon ones, by figuring condition 2. IDF grid is introduced in Table 5: (2) e below equation (3) presents the TF-IDF scores: In equation (3), t means the terms, d signifies each record, and N indicates the complete number of reports. Consider Table 6, which addresses the report term lattice with TF-IDF scores for the recently expressed model. A term will have a huge weight when it much of the time happens across the archive yet inconsistently across the corpus. e word malware may show up frequently in an archive, but since it is probable reasonably entirely expected in the remainder of the corpus. To reveal the connection between the words and records and catch the latent themes inside the Android security dataset, dimensionality reduction must be performed, as examined in the following area.

Task B: Learning Latent Relationships between Documents Using SVD (LSA).
Utilizing SVD, two sets of loading matrices were produced as the output of LSA. One is a document-to-topic matrix and the other one is a term-totopic matrix. e topic solutions are the number of research themes in the literature dataset. High term or document loading in the matrix cell discloses the fact that a specific term or document is more inclined towards a particular topic solution. e researcher can adjust the detail level of a number of topic solutions for identifying research areas and trends. Smaller values of topic solution represent common research core areas, and higher values of topic solution represent principal research trends [51].
Truncated SVD is a framework variable-based math method that breaks down the TF-IDF lattice into a result of three grids: U, Σ, and V. e SVD disintegration is shown in Here, A addresses the TF-IDF lattice, U addresses the document-to-topic framework portraying relationship between documents attached to different concepts, V addresses the term-to-topic depicting relationship among concepts and terms, and Σ is composed of nonnegative numbers.
Suppose d is the number of records, t is the number of terms in the documents, and k is considered as the hyperparameter demonstrating the quantity of points to be separated from the corpus. A k is the low-rank estimate of matrix A and can be delivered utilizing shortened SVD as continues in where U k is the document-to-topic matrix (d × k), V k is a term-to-topic matrix (t × k), and Σ k is the topic-to-topic matrix (k × k). Table 6 shows the changed term frequencies subsequent to applying TF-IDF. SVD procedure must be applied to the TF-IDF matrix introduced in Table 6. Tables 7 and 8 contain the factor loading values that are arbitrarily positive and negative. e set of terms and documents need to be mapped with the latent topics. To interpret the meaning of the loading values, the technique known as varimax rotation was applied on terms and document loading matrices. e varimax rotation helps to uncover the best correlation of terms with the latent topics. e rotation magnifies the association of terms and documents to the latent topics. Furthermore, a threshold value needs to be selected to discover the significant terms as discussed in [3,5]. Empirical probability distribution was utilized to select the threshold values for different factor solutions. e loading values are transformed into a vector and sorted in descending order, thereby defining the threshold as retaining 1/n of the loadings, where n is the factor solution as explained in [5,6]. For each factor solution, loading values are grouped by considering their absolute values to unveil latent topics. As an application of LSA followed by an unsupervised machine learning approach, discussed further, it will help to identify topic solutions.
TRENDMINER is used to identify the core research areas and significant research trends in Android security, and an optimal value for k topic solutions has to be determined. Choosing an optimal value for k is always a challenge; because the more the number of dimensions k chosen, the more will be the risk of induction of noise in the data [58,71]. However, at the same time, selecting a smaller value of k will lead to losing important semantics. It is a good practice to include a bigger k, as an approach to deduce more trends or classify many trends into a single category [72]. A k-iterative process has been applied to uncover the core research areas and their subclassification of related trends. SVD provides the matrix of singular values that are defined as the square root of the eigenvalues. ese values provide 8 Complexity the concept strength and are arranged in descending order. e k singular values are selected using a scree plot as depicted in Figure 9. As illustrated in the study [24], a high level of topics must be chosen using an empirical approach that involves multiple trials of LSA. e number of factors in individual trials ranged from 2 to 10. After reviewing highloading terms/documents for each factor solution, experts decided to set three as core high-level research areas. It should be noted that it also depends upon the semantic space chosen for the experimentation.

Task C: Topic
Clustering. As stated in [3], clustering and factor analysis are the two analytic steps that are involved in post-LSA procedures. e authors discussed the main considerations that would let practitioners/decisionmakers/researchers deploy these analytic steps as per their requirements. ey focused on the fact that LSA has been used for clustering and factor analysis purposes. Based on the semantic space created in this study, the domain experts decided to pursue the clustering technique. e clustering approach was implemented through the K-means algorithm. Machine learning can be employed on top of results obtained after the application of Latent Semantic Analysis to significantly reduce the manual effort by a domain expert in determining the document to its closest topic. K-means is an unsupervised machine learning technique generally used when there are no labels of the data points and it learns them based on their relative positions in a vector space. e centroid feature weights may be used to identify the nature of the cluster while defining the groups, which may be used to label new data [75,76]. K-means is easy to implement and can process extremely large samples [77]. Usually, the inputs into K-means are passed through a dimensionality reduction algorithm. LSA and K-means are applied in a linear combination for the interpretation of the results to find similar documents and their associations with the terms contained in the textual corpus [78][79][80], which is done to recommend research papers corresponding to a particular topic label. e interpretation of the results obtained is domain-specific. For instance, if data points were research articles on Android security in extensive literature, K-means will segregate the entire documents into k subgroups. e research trends in the domain of Android security which are a part of each subgroup or cluster have some common features, which are used for further analysis. e number of clusters was chosen to be three, with the selection done iteratively. It is to be noted that the choice of too few clusters may not reveal the actual underlying relationships, while too many clusters may account for noise, which would not be useful for any further analysis on the outputs obtained. e output, in the form of a multidimensional array, is composed of titles for all documents labeled with the respective cluster numbers. Taking the dot product of the components obtained from LSA with the cluster centroids, the results obtained are sorted to show only the top topics corresponding to each cluster, which require sensible topic labeling as discussed in the next section.

Task D: Topic Labeling.
e term-to-topic and document-to-topic matrices consist of significant values to uncover topics. Each cell in both matrices represents the loading values which were later sorted in descending order. e results obtained from previous steps of TRENDMINER become the input for successful topic labeling. High-loading terms and documents were examined together and sensible labels were given against three and twenty-seven topic solutions, as shown in Figures 10 and 11. We have implemented the Delphi method [81] to perform the topic labeling process. e graphical representation of the Delphi method is also shown in Figure 12. Topic labeling is a collective intelligence task that involves the most reliable opinions of a group of experts. e Delphi method is an iterative method that worked under controlled monitoring and feedback mechanisms to build robust consensus.

3.4.
Step 4: Results and Findings. As a result, three topic solutions present the major core research areas, as shown in   Figure 13 were discovered as three topic solutions that focused on "Application Structure Analysis" (T3.1), "Static Level Monitoring" (T3.2), and "Automatic Malware Analysis" (T3.3). e word cloud for three topic solutions is shown in Figure 14. ese articles emphasized imperative techniques to analyze, detect, and assess Android malware. e outcomes showed that various high-stacking distributions joined to one exploration region, i.e., "Static Level Monitoring" (T3.2) in the three theme arrangements. Static investigation is the most used examination strategy for       12 Complexity technique was first explored by the researchers [83,84]. e former investigated the data flows in applications that violate the security policies stored in an application's configurations. e latter identified the data leakage from sensitive sources of an application. Notwithstanding static and dynamic methodologies, there exist a couple of hybrid approaches that take advantage of the upsides of investigation such as static and dynamic. ese techniques typically first apply static investigation to identify potential security threats in an Android system and after that perform dynamic procedures to enhance their accuracy by dispensing with the false alerts. For instance, in [85], the authors first used the static investigation to distinguish possibly vulnerable applications.

Task B: Identification of Android Security Research Trends and Task C: Core Research Areas and Trend Mapping.
e TRENDMINER uncovered 27 subject core research trends as displayed in Figures 15(a) and 15(b). Figures 10   and 11 show the relationship of core areas with the research themes. e relationship is performed dependent on similarity scores. Documents were clustered into a lesser number of topic solutions as a start, while the higher value was chosen later. e points comparing to the last were to some degree identified with the previous and were checked utilizing similitude scores. e likeness scores were determined because of string coordinating, with the string similitudes indicating the closeness of the low and high upsides of theme arrangements. is was done to verify the understanding that the result while choosing a lower value of topic solutions would correspond somewhat to having chosen a comparatively higher value. e likeness scores present a reasonable connection between the core areas and their connected (1) Application Structure Analysis (T3.1). e trends Metadata-Based Study (T27.4) and App Level Features (T27.2) revealed the utilization of metadata. is pattern was found in the system named WHYPER [86], the researchers get to the permissions mentioned by applications' developers and utilized natural language processing (NLP) algorithms to search for sentences in application description that legitimizes the requirement for the mentioned permissions. Similarly, in another work, the study on metadata was accelerated by accounting additional information such as a number of application's screenshots, price, category, title, developer ID, website, and promotional videos. Furthermore, the analysis of application metadata was performed using machine learning algorithms. e trend Application Level features (27.2) unfolds the usage of CPU and memory usage to track malicious applications. In the project named MADAM, running processes, CPU utilization, memory state, Wi-Fi, and Bluetooth of the device were considered to train the k-nearest neighbor algorithm for effective detection [87]. In the topic solution Permission-Based Analysis (T27.21), authorizations played indispensable component for examination of vindictive applications, as most actions require explicit assents remembering the ultimate objective to be accomplished [88]. Permissions are declared in the manifest file and therefore, easy to obtain. Numerous systems, developed in studies [86,89,90], use static examination to evaluate the risks of the Android consent system and individual applications. 14 Complexity Another significant research trend emerged as Analysis Based on Network Addresses (T27.1), focused on network addresses. Malware authors make use of network addresses to build communication with command and control (C&C) worker to send the client's classified information. Analysts discovered IP addresses as one of the key static components for investigation [91][92][93].
Another examination pattern that arose in this space is the Dex record study (T27.7), which played a vital role in understanding the dex files, which are usually cumbersome to interpret by humans. To recognize malevolent code sections, scientists first decompile the dex code into more possible organizations such as gathering, Smali, Dalvik bytecode, source code, container, Jimple, or Java bytecode [94]. is trend can be further relate to numerous articles  Complexity and tools deployed by researchers for successful translation such as dexdump [95], Pegasus [96], ded [97], SAAF [98], PScout [89], AppSealer [99], ded/DARE [100], dedexer [90], dex2jar [101], and FlowDroid [102]. e core research area discovered interesting research trends such as Data Flow Tracking (27.6), Interprocedural Control Flow Graph (27.16), and Graph-Based Analysis (27.11). All emerged trends relate to an interesting and pivotal branch in the field of static security mechanisms to identify commandeering vulnerabilities in the Android ecosystem. Data Flow Tracking (T27.6), which deals with tracing out the flow of sensitive information from the device to outside entities at the time application execution [103][104][105][106][107], came out as important and consistent topic. Information stream examination and control stream investigation help in understanding the hazardous usefulness such as protection spillage and communication administrations abuse [95,108,109] by tracking the flow of information across different points of execution.
Bytecode control-flow graph investigation recognizes all possible ways that an application can take while it is executed.
ese deduced trends helped in fostering advance investigation, by creating control flow bytecode graph (CFG) for intraprocedural analysis or between procedural investigation (crossing across various strategies). Creators in [110] formalized the Dalvik bytecode to play out the control stream investigation-based semantic marks to recognize malware applications. e studies [89,95,96,102,104,108,111] leverage the trends Data Flow Tracking (27.6), Interprocedural Control Flow Graph (27.16), and Graph-Based Analysis (27.11). e trend of Intent Monitoring (T27.15) relates to the concept that intents declared in the application's manifest file are capable enough to leak the data to C&C servers. Intents are the objects which are used to move from one activity to another by making use of widgets in an Android application. Starting an activity, starting a service, and delivering a broadcast are the three fundamental use cases of intents, helps in establishing the communication between components in several ways.
is trend was found in popular studies [91,112]. e former employed numerous machine learning algorithms such as K-means, k-nearest neighbor, and naïve Bayes to analyze the intents, permissions, components, and APIs that were extracted from the manifest file. e latter employed support vector machines to detect malware and achieve a detection rate of 94%. Another trend Hardware Component-Based Inspection (T27.12) reflects the analysis of hardware components listed in an application for static investigation. Researchers in [91] made use of the components declared in the manifest file for analysis. is can be compelling as malicious applications with a specific end-goal demand all the hardware, e.g., camera, GPS, and microphone.
Estimation over String Matching (T27.8) is found as another significant trend in this area, which uncovered the analysis over various strings available in an Android application. Work done by researchers in [113] expressed that it is one of the broadly utilized strategies for recognizing the malware through analyzing the strings, accessible in the Android files. Scientists utilized the Vector Space Model (VSM) [114] and addressed the strings as vectors in a multidimensional space. Besides, scientists utilized distance estimates such as Manhattan distance, Euclidean distance, and Cosine similarity to learn irregularity of the data. e researchers assessed the outcomes over 666 samples of Android applications and accomplished 83.51% accuracy in their tests.
(3) Automatic Malware Analysis (T3.3). Figure 11 demonstrates research trends under T3.3. is core research explored the research trends Pattern Assessment (T27.20), Input Matching (T27.14), Repackaged App Identification (T27.23), Formal Analysis (T27.10), and Machine Learning Approach (T27.17) which were related to automation in identifying Android malware. To gather a predefined set of application features, researchers focus first to analyze application statically or dynamically. Furthermore, build a detection model capable of distinguishing malware and benign applications based on the training dataset. e trend proved as well explored and promising as researchers used numerous combination of different features such as API call sequences, permission request, package information, hardware components, application categories, and network activity to build detection models, as reported in studies [91,[115][116][117][118]. Another exploration pattern that arose was Repackaged App Identification (T27.23). Many articles such as [119] related to this trend were published in recent years. DroidMoss [88], Droidsim [120], DNADroid [121], View-Droid [122], ResDroid [123], and AnDarwin [124] have witnessed to tame the problem of repackaging. e trend Pattern Assessment (T27.20) uncovered the fact that an attacker can deduce sensitive information of the user by accessing the behavioral pattern of shared resources. e impact of this trend has been seen in a variety of articles [125][126][127][128][129] where side channel communication was compromised to infer confidential input patterns such as PIN, password, or screen taps.

Discussion and Potential Future Directions
is section determines that the results obtained from TRENDMINER can be used to answer the research questions stated in Section 1. ? Figures 7 and 8 present the top journals and leading researchers in the Android security field. Some of the top journal lists include Computer and Security, IEEE Transaction on Information Forensic and Security, Future Generation Computer System, Journal of Information Security and Applications, and Journal of Networks and Computer Applications. Suarez Tangil has a major contribution in the research community who has framed a variety of antimalware techniques such as Alterdroid [130], Dendroid [131], and Droidsieve [132]. A fully automated malware identification mechanism with an appreciable accuracy of 82.93% has been framed by Wang 16 Complexity et al. [133]. Enck et al., who proposed a project named Taintdroid [83], are top leading researchers in this field. He had developed an effective model for tracking sensitive information leakage in third-party applications. On top of this, many other dynamic analysis tools such as Andrubis [134] and Droidbox [135] were deployed. He was first to perform on-device malware assessment in which authors defined a set of rules to identify dangerous permissions granted before installing the application, by the security service known as Kirin [136]. To detect kernel-level attacks, Yan and Yin presented a project named Droidscope [137], which is a unique method of dynamic analysis by keeping its process out of emulator and was able to achieve promising results. Faruki et al. [138] proposed a methodology called Androsimilar which produces marks by extricating measurably powerful components, to identify noxious Android applications. Proposed strategy was powerful against code jumbling and repackaging methods that will in general engender concealed variations of known malware by avoiding AV signatures.

RQ2: Are ose Frameworks Robust Enough to Determine the Most Investigated Research Areas?
e consequences of the examination showed that Static Level Monitoring (T3.2) had been end up being the most generally researched point in Android malware investigation and location. e strategies utilized under Static Level Monitoring (T3.2) analyses the code without running the application on an Android emulator or gadget. e upside of static investigation is that the expense of calculation is low, less dreary, and low asset use. Figure 16  Nine such trends showed a downfall in time frame 2. Examination under this work uncovered that studies identified with static level observing significantly center around network addresses, information stream, control stream, string coordinating, consents, dex documents, setting, and purposes.
Static level monitoring emerged as an important technique to accomplish various security concerns such as detecting private data leaks, detecting component hijacking or intent injection, building frameworks for intercomponent vulnerabilities and content provider-based vulnerabilities, dangerous permissions used by malicious applications, energy consumption concerns by Android applications, comparing Android applications for clone detection, automatic testing by generating test cases, and checking the correctness of the Android application through code verification. On further investigation, it was found that there are various tools available for static monitoring, such as Soot, Dex2jar, Dexdump, Dedexer, Ded, Dare, and WALA. Soot is the most adopted support tool for static monitoring, and Jimple is the widely used intermediate representation (IR) format for the further analysis of Android applications. e trend line in Figure 16 illustrates that specific research trends orient towards sensitivities. Sensitivities maximize the precision and recall of static monitoring. e research trends Field Sensitivity (T27.19), Context Sensitivity (T27.22), and Flow Sensitivity (T27.24) are primarily taken into account by the Android research community. Other research trends, such as Path Sensitivity (T27.13) and Object Sensitivity (T27.3), have not gained much attention from the researchers. e trend line also revealed that the trend Taint Analysis (T27.27) widely used in data tracking emerged as the most applied technique in static monitoring.      Table 9 revealed that the trend "Program Slicing" (T27.25) had gained momentum during 2015-2019. e trend "Program Slicing" (T27.25) specifies the technique by focusing on selected aspects of semantics for simplifying the programs. Slicing avoids those parts of the program that may not have caused the malicious behavior, instead focus attention on only those parts of programs that may contain malicious behavior. is technique tends to reduce the set of program behavior and hence became trending during 2015-2019. (l) e trend "Field Sensitivity" (27.19) appears to be the most considered among all the sensitivities, depicted in Table 9. It may be due to the reason since Android apps are written in Java, an Object-Oriented language where object fields are pervasively used to hold data. Research trends such as "Context Sensitivity" (T27.22) and "Flow Sensitivity" (T27.24) are also largely taken into account. e least considered sensitivity is "Path Sensitivity" (T27.13) and Object Sensitivity (T27.3); probably, it is because of the scalability issues that it raises. (m) e trends "Type and Model Checking-Based Analysis" (T27.5) showed a sudden fall during 2015-2019. When an Android application is developed for some task, it is common to define a certain set of properties that the application must satisfy. Model checking helps to ensure that the given system has met given specification or correctness properties. Type checking ensures that the given program is type-safe by keeping the possibility of type errors (e.g., applying integer operations on float numbers) to a minimum.  (a) Mapping of API usage with permissions to achieve more fine-grained results: API calls are used to communicate and transfer sensitive information over the network. Malware families such as Fakeinst, Opfake, and Smsreg make use of API calls such as sendSMS() and readSMS(), which implies that collected information may be sent by SMS. ere is an urgent need to deeply analyze the API calling patterns and what permissions these APIs demand [139]. (b) Complications in static analysis: static analysis techniques are incapable when applications are made using camouflage techniques [39,[139][140][141][142][143]. Static analysis also leads to a large number of false positives [7,144]. (c) Evolution of intelligent malware: applications tend to use techniques such as rooting, antidebugging, code obfuscation, and kernel-level features to dodge the detection process [145,146]. Despite this, most of the approaches still implement emulators. Limited efforts are made to curtail remote triggering. It enhances the stealthiness of malware by allowing malware authors to trigger and execute malware whenever they want [147]. (d) Development of nonintuitive features for robust malware analysis and detection: static and dynamic features need to be explored to the next level to characterize the behavior of an application [146] better. Attackers repackage the legitimate app to insert the malicious snippet and distribute it via stores [88]. (e) Need of automation in malware classification: development of semisupervised approaches to detect the malicious applications [146,148] and faster detection and classification of malware families is required [141]. Also, the features and characteristics of a family that can be used to classify malware to a particular family have been less discussed among the research communities [7]. (f ) Hindering the effectiveness of dynamic analysis: computation time and resource constraints are the major reasons for the hindered performance of dynamic analysis [7,39,140,143]. To ensure that an application had triggered all its malicious behavior (all execution paths traversed) during dynamic analysis is a matter of concern [141,142,144]. (g) Limited availability of datasets: limited availability of ransomware datasets and lack of understanding of smart tactics limits the efficacy of detection mechanisms [149]. Generally, researchers download the samples from VirusTotal [150]. (h) Low-precision prediction mechanisms: the biggest challenge being faced by researchers is the high rate of false alarms in predicting ransomware. Most of the present techniques produce a large amount of false positive and false negative alarms, which affects the accuracy of detection mechanisms. ere is a need for a cutting edge methodology to produce fewer false alarms [149]. e study uncovered that methodologies of examining malware incorporate static examination and dynamic investigation or perhaps a blend of both. e static examination essentially centers around dismantling the code, trailed by manual examination to look for the pernicious examples in the code. On the other hand, dynamic investigation executes the code in the virtual platform and breaks down its execution follow to notice the noxious conduct of an application. e static examination helps follow unique and full execution ways; subsequently, it gives total code inclusion; however, at last it experiences code obscurity. e application must be decoded first to perform static investigation. e issues of obstinate intricacy ruin the examination. Dynamic examination is more productive and need not bother with the executable to be unloaded or unscrambled. e dubious application is checked in a controlled arrangements. is cycle is time and asset devouring. It additionally raises adaptability issues. Besides, some malevolent conduct may be unseen on the grounds that the environment does not fulfill the setting off conditions. Besides, malware creators utilize mechanization innovation to produce a colossal measure of new malware variations, accordingly representing a major test to malware experts. e current situation with the-workmanship requests the combination of existing crude strategies with valuable methods to accomplish a powerful arrangement. e yield of TREND-MINER proposes that strengthening strategies ought to be utilized to supplement the arrangement of quickly developing Android malware families. Beneficial methods can end up being viable in deciding strange current vindictive conduct or security weaknesses. In view of the assortment of information got by this investigation, a plan for designing a cutting edge environment has been imagined for the characterization of Android malware families, as examined in the next section.

Towards Engineering a Visualization-Based Solution
Malware is developing quickly which is a result of the ability of malware creators to change little pieces of the first source code to produce new malware variations. A malware variation can be imagined as a grayscale image. A picture can catch even little changes. us, in the current work, a perception structure is proposed to decrease the impact of obscurity by changing the malware's noninstinctive components into unique finger impression images followed by the arrangement of Android malware families. e proposed methodology which is known as the SWAYAM (Stop WAY for Android Malware) system is shown in Figure 17.

Module I.
is module deals with converting the malware samples into digital images. e malware binaries are first converted into 8-bit vectors and then converted into grayscale images. e overall structure of grayscale images is composed of various sections. Each section has a fixed width, but height is varied according to the file size. In a nutshell, malware samples tend to be represented as images and there is a strong propensity that malware variants from the same family form similar and visual implications [151]. On the other hand, malware samples from different families show dissimilar structural and visual implications.

Module II.
Once the images are converted into digital images, the next step is to extract the features out of the images. Features play a vital role in classifying malware samples to a particular family. Various image descriptors such as Global Image deScripTors (GIST), Gray Level Cooccurrence Matrix-based (GLCM), and Local Binary Pattern (LBP) are available to extract the features from the images and thus formed a feature vector. Texture patterns, intensity, color patterns, and frequencies in images constitute the image features of the samples. Euclidean distance or standard deviation can be used to measure the distance in feature space [152].

Module III.
Further machine learning algorithms or neural networks are employed over feature vectors to identify the family of a sample. For instance, in the KNN approach, a sample is classified to family f1 if it has k-nearest neighbors belonging to family f1. It is to be noted that many solutions leveraging machine learning and big data techniques are appearing to develop malware detection models [153][154][155]. Computer vision techniques have been becoming popular among the research communities to detect and classify malware applications [156,157].  Complexity security. It depends upon certain factors, for example, the type of queries and sources used while preparing the literature dataset. To discover the appropriate publications, the articles were selected using "malware" OR "vulnerability" OR "security" OR "privacy" OR "monitoring" OR "application" OR "smartphone" OR "android" OR "virus" OR "static" OR "dynamic" OR "detection" OR "data flow" as search keywords. e prominent databases which were leftover during the automated search were also browsed to get the influenced publication in the area. Relevant papers were filtered using inclusion and exclusion criteria on the search results to limit the purpose of the current study. Nonetheless, it may be possible that a few significant publications may have been left during the process.

Limitation of the Study
TRENDMINER is backed by the goodness of the Latent Semantic Analysis (LSA) technique. LSA being an unsupervised way of uncovering synonyms improves the vector space model. However, the number of topic solutions cannot be decided statistically. To alleviate this situation, the value for an optimal number of topic solutions was decided after having intensive discussions with an expert. Ultimately, this work deduced that the process of topic labeling was purely based on human judgment, which may lead to subjective bias as well.
ere might be impediments identified with the speculation of the outcomes. A stepwise procedure was followed to infer the core research areas and research trends. e procedure included literature collection, preprocessing of the dataset, generation of TF-IDF matrix, truncated SVD, and topic labeling. Every step in the algorithm tends to influence the results. For instance, the outcomes will be influenced if the dataset used in this study is modified to a composition of only titles or full-length articles.
Having done LSA representation of some documents, a new document cannot be just added to this collection. A new document, hence, can only be added incrementally. It fails to capture the elements of the new documents added. Hence, the performance of LSA degrades on the addition of new documents, allowing recomputation.

Conclusion
One of the key inspirations of the work was that the conventional manual literature reviews are often not ready to exploit huge literature because of human obstructions in time and insight. Hence, this study proposed another literature review method to deal with this challenge. is study unveiled a framework called the SEAR framework, which can perform subjective and quantitative investigation over enormous literature. It is an adaptable and versatile framework to draw information-driven investigation and conceptualize the advancement of inclining research measurements in any field of literature. e SEAR framework utilizes the linear combination of information modeling technique, i.e., LSA followed by the K-means clustering algorithm, which enables connections and groupings to be recognized that are usually missed by manual techniques constituting human interpretations. Machine learning techniques have reduced the manual effort to a great extent in determining the document to its closest topic.
TRENDMINER is designed as the use case of the SEAR framework. To exhibit the utility and use of TRENDMINER, a wide body of literature on the Android security field was utilized as the contextual investigation. e framework takes the contribution of 444 abstracts of research articles distributed during the period 2010-2019. is study identifies three core research areas and twenty-seven research trends as outcomes. Results demonstrated that specific research patterns have stayed reliable over the examined time frame. Taxonomy and future research directions in the field of Android security have been provided in this study. Time trend plots for each factor solution have been discussed. Some research trends have developed while a couple has likewise declined. TRENDMINER amplifies the utility and commitment by proposing potential future research directions in developmental research to mitigate human predispositions. is study also stresses answering the research questions framed with respect to the technique being employed and the dataset chosen. is paper additionally exhibited general suggestions to help new researchers to comprehend the idea of Android security research and assess their regions of interest for their latent capacity research alongside the related research pattern.
is examination additionally sets up an objective and observational establishment for future directions about the structure and analytical decomposition of Android security research. e particular research area and trends uncovered in this work can engage future research dimensions, which can be utilized by the research scientists and industry. Furthermore, researchers can pick at least one research area and make another investigation with the equivalent or another approach. Nonetheless, other factual factor investigation strategies can apply to this exploration. For future work, the researchers can apply a similar technique to a different comparable dataset to see the proclivity and decent variety of core research areas and trends inside related articles. To increase the application areas of this research, the SEAR framework can be enhanced by building a dynamic query system on the same or different corpus by applying deep learning models.

Practical Implications and Future
Research Directions is manuscript exhibits a panoramic view of the Android security field. e study has certain interesting practical implications. First, the research areas and trends uncovered in this work can engage future research dimensions, which can be utilized by the new research scientists and industry. e analysis obtained from the study can assist them to understand the diversity and depth of the Android security field. Second, the academic universities can enhance their teaching content and students' motivation by revising the curriculum to focus more on research activities related to the Android security field.
ird, perspectives drawn from the research will help the editors of the esteemed journals to plan the special sessions on Android malware research topics such as static analysis of Android applications, security and privacy for IoT and multimedia devices, application-focused threats, new frontiers in Android malware analysis and detection, cryptojacking, component-based Android malware analysis, deep learning for Android malware classification, deep learning for digital forensics, and cybersecurity. ere are avenues for future research which are discussed as follows.

Ranking Permissions for Android Malware Analysis and
Detection. Using too many features for Android malware analysis and detection is a cumbersome task. Permissions as a special feature of the Android ecosystem are present in the manifest.xml file of the Android file structure. Permissions are needed to perform the application-sensitive operations. ey are embedded in the manifest.xml file in the form of text. ey play a vital role in detecting the suspicious application running on an Android device. Some permissions which malware authors use to exploit the sensitive information from the device are access_coarse_location, access_ ne_location, access_network_state, access_wi _state, bat-tery_stats, answer_phone_calls, bind_carrier_messaging _service, read_contacts, read_call_log, read_phone_state, read_external_storage, read_sms, record_audio, request_in-stall_packages, read_calendar, bluetooth_privileged, read_-history_bookmarks, and many more. e most important permissions in the malicious dataset can be identified using a technique called Term Frequency Inverse Document Frequency (TF-IDF) which can later lead to the discovery of malicious applications. It would help in maintaining an application-permission matrix that would describe the frequency of permissions that occur in the collection of malicious applications. TF-IDF assigns the permission value to each permission and calculates the sensitive value of each application by utilizing its weighing formula as discussed in this work. Furthermore, machine learning algorithms may be deployed to perform the detection or classification of Android malware applications.

Crowdsourced User Reviews at Application Stores.
e suspicious application can also be identified by evaluating the user reviews at the application stores. e feedbacks of the users are vital as they tend to write reviews about the particular application based on their real-time usage and experience. e security firms cannot ignore the reviews whether they are positive or negative. e user reviews are expressed for various purposes such as functionality, UI (user interface)/design, battery consumption report, and other security issues of an application. Furthermore, the security issues in the application are broadly classified into four categories: malware code injected into the application for monetary benefits, spamming, information leakage, and use of overprivileged permissions in the application. Latent Semantic Analysis can be applied to crowdsourced user reviews to discover security-related issues of the application. At the initial step, relevant reviews can be filtered out from the noisy crowdsourced reviews by applying the preprocessing techniques as employed in this manuscript. e relevant terms in the reviews may be then mapped with Android API documentation to form the clusters based on the components addressed in the review.
Assume the user review for the cricket game application, "Whenever I open this CRC League application, it automatically clicks my photograph and also deducts one dollar from my account. I also received the message that says ank you for subscribing to IOIO service." After reading this review, one undoubtedly thinks that this is a malicious application. ere may be hundreds of reviews related to this context. e data-driven analysis here can understand the text structure, words, and the topic discussed in the review. is review reflects that this application accesses camera, sends the SMS, and deducts the amount from the user account. One may think that a cricket game can never be made for performing these types of sensitive operations. is scenario only depicts the security issue of an application. erefore, the semantics of the review can be discovered to flag these applications as suspicious using LSA.

Preserving the Proprietary Rights of the Android Developers.
Repackaging is an open issue in the Android malware detection and analysis field. Using this technique, malware authors first download the legitimate application from the application stores and then extract all files and folders of the application. After the extraction process, they inject the malicious code or segment into the application and upload the same on other application stores. ey also entice users to download that malicious application by performing social engineering activities. Innocent users not aware of this fact get trapped and download the malicious version of the legitimate application. In this way, the malware penetrates the phone and their device gets compromised. Repackaging thus opens the other dimensions for the malware authors to generate malicious clone or plagiarized versions of the legitimate applications. In a nutshell, the proprietary rights of developers are widely exploited and abused among malware authors to create clone Android malware variants of legitimate applications. Furthermore, they also deploy the evasion technique to dodge the detection process. In this scenario, LSA can be used to infer the semantics from the corpus of source code files. e degree of similarity can be measured by comparing the code segments of the source code files.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.