Using Latent Dirichlet Allocation and Text Mining Techniques for Understanding Medical Literature

Over the past few years, numerous studies and research articles have been published in the medical literature review domain. The topics covered by these researches included medical information retrieval, disease statistics, drug analysis, and many other fields and application domains. In this paper, we employ various text mining and data analysis techniques in an attempt to discover trending topics and topic concordance in the healthcare literature and data mining field. This analysis focuses on healthcare literature and bibliometric data and word association rules applied to 1945 research articles that had been published between the years 2006 and 2019. Our aim in this context is to assist saving time and effort required for manually summarizing large-scale amounts of information in such a broad and multi-disciplinary domain. To carry out this task, we employ topic modeling techniques through the utilization of Latent Dirichlet Allocation (LDA), in addition to various document and word embedding and clustering approaches. Findings reveal that since 2010 the interest in the healthcare big data analysis has increased significantly, as demonstrated by the five most commonly used topics in this domain.


I. INTRODUCTION
VER the past few years, research data in the medical domain has increased significantly [1][2][3].Handling such an enormous amount of data is one of the crucial challenges for various stakeholders in this domain.For instance, extracting appropriate knowledge for supporting decisions of medical organizations is one of the most important factors that can assist in analyzing past, current and future risks [4].Given the exponential growth of the published literature in the medical domain, the challenge of identifying and extracting important and meaningful medical data and associated literature-based trends becomes even more pronounced.This challenge gets amplified, knowing that 80% of online data is in unstructured format, making it difficult to process and analyze by computers [5,6].
Therefore, the need to apply new techniques that can seamlessly make medical literature information more accessible and useful becomes more critical than ever before.Recently, more studies have started to focus on medical bibliometric and literature analysis for a better understanding of the research trends in this domain.Also, with the emergence of text mining tools and techniques, the number of studies that use text mining applications has grown tremendously [7].As reported in [8], the utilization of such tools and methods has proved to advance information and knowledge extraction across multiple application domains.As such, we exploit text mining as a potential solution to address the problem of deriving important knowledge from historical medical data.Since this domain is interdisciplinary and attracts researchers from the fields of information O retrieval, machine learning, statistics, computational linguistics and especially data mining as reported in [9] and [10], our purpose of this study is to explore several text mining methods to better understand and analyze earlier attempts and research works in this domain.Accordingly, our study presents a literature and bibliometric review based analysis, associated with medical words embedding and several other techniques that we applied on a dataset that comprises 1945 research articles that were published between 2006 and 2019.The first part of the analysis refers to quantitative analysis wherein we present an overview on the distribution of medical big data analysis papers over the time, keyword frequency by year, merging keyword frequencies, and topic detection by keyword frequency statistics.
In the second part, we apply clustering and Latent Dirichlet Allocation (LDA) techniques on the abstracts of the articles to detect the topics and to know the most important research areas which discuss medical data analysis.It is important to point out that by conducting this research work, our aim is mainly to assist improving the understanding of important topics and trends in the medical domain, as well as to summarize the topics that have been detected with the aim of identifying their main characteristics across similar references.
The rest of this paper is organized as follows.First, we discuss state-of-the-art medical literature-based analysis approaches that have been proposed in this domain.In addition, we present our methodology for medical-related bibliometric and literature data analysis.Second, we introduce our analysis of medical research articles and highlight the main findings.Then, we discuss the results and draw conclusions.Finally, we present the extensions of current work.

II. RELATED WORK
Text mining is a knowledge-intensive process where a user analyzes a collection of documents through applying a suite of analysis tools [11].Conventionally, this process is applied on data sources that are in unstructured formats from where it is possible to extract useful information through the identification and exploration of interesting data patterns [11,12].As reported by Dang et al., there are five basic text mining steps that are normally followed [5], these are: a) Collecting information from unstructured data.b) Converting this information into structured formats.c) Identifying patterns from structured data.d) Analyzing the patterns.e) Extracting valuable information and storing it in the database.
In this context, text mining techniques transform unstructured content of a collection of documents into a more manageable and understandable format that can be further processed and analyzed [11,13].It can discover hidden knowledge from large volumes of data across various areas such as bioinformatics, marketing and business, to name a few [14][15][16].Text mining approaches rely on algorithmic and heuristic techniques to analyze distributions, frequencies and associations between the terms of the document collection, in an attempt to answer questions, such as: What is the general trend of the topic in a period of time?What are the most frequent words?Are these maintained over time?How are the words related?What could these relationships mean?
The different analysis techniques are used to compare the frequencies of the concepts in the documents and these are the discovery of ephemeral associations and the detection of deviations [11].
Over the past years, literature and bibliometric analysis has been conducted by reviewing the literature manually.The articles were collected, classified, and reviewed to identify important information that can be further used by various stakeholders of the domain of interest.In this context, a brief reading of the literature needs to be carried out, giving more attention to abstracts of articles as they provide summaries from which readers can decide its relevance to the domain under investigation [17].For instance, DiMatteo developed a quantitative review of 50 years of medical research on patient adherence to treatment.In this article, he explained that he used an electronic database to search the literature by keywords while the citations and abstracts were examined by the author and research assistants [18,19].
Similarly, the authors of works [20,21] carried out a systematic literature review of factors affecting outcomes in older medical patients admitted to hospital.In these researches, it is reported that the analysis was conducted using statistical methods where an independent statistician was consulted.Currently carrying out a literature analysis or a bibliometric study with manual methods has become almost impossible due to the ever increasing amount of published literature in the domain.For example, reviewing and analyzing a dataset with 1945 research articles in the medical domain would take a very long time of manual exploration and revision, and will call for many specialists to accomplish this task.As such, given the exponential growth of literature published in the medical domain, exploiting existing data mining tools to carry out the review task has become more indispensable.Nowadays, many researchers are turning to use various text mining tools for bibliometric analysis.In 2015, Hsu et al. carried out a research on the digital archives, where they applied techniques of text mining for co-word analysis and clustering to detect the most talked about topics [22].Bach et al. carried out a qualitative analysis of literature using a systematic literature review, citation and co-citation analysis in the financial sector with text mining [23].In a similar line of research, Amado et al. presented a research literature analysis based on a text mining semi-automated approach with the goal of identifying the main trends in marketing using the LDA techniques [4].Similarly, Moro et al. analyzed recent literature in the search for trends in business intelligence applications for the banking industry LDA [24].In 2018, Liao et al. developed a bibliometric analysis on medical big data research where they analyzed the types of documents VOLUME 20(4), 2021 and their frequencies, as well as frequencies of keywords, the most mentioned magazines, and the relationships between the extracted keywords [1].

III. MATHEMATICAL FORMULATION
Latent Dirichlet allocation (LDA) is a widely-used generative probabilistic model that has been implemented to identify topics for collections of discrete data, among which textual data are the most common.It is formed based on a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities as described in the below equation [25]: . As such, and in this context, each topic probabilities are employed to form an explicit topic representation of a document.Accordingly, each topic will be represented using a set of words to which all the documents are mapped under the hypothesis that words that frequently appear together are likely to be close in meaning.

IV. DATA ANALYSIS AND INFORMATION EXTRACTION
In this study, we analyze and visualize medical literature related data using one of the most popular data mining tools; that is RapidMiner1 which is a comprehensive data analysis tool that provides various operators that can be applied for the analysis of large amounts of data.The primary Natural Language Processing (NLP) steps of the text mining procedure are: Tokenize, Transform Cases, Filter Stopwords, Generate n-Grams and Filter Tokens.These steps are depicted in Fig. 1.

Figure 1. Five basic NLP operators
We have used VOSviewer 2 tool to create the network and density maps in order to visualize the main topics and the cooccurrence of the words.
First, we analyzed the frequency of published papers by year to know the growth of the literature of text mining in the medical domain as shown in Figure 2. We can see increasing growth in research publications, mainly since the year 2012.For the frequency analysis of keywords, we extracted the keywords per year to observe the trend that exists in the topics discussed.In this analysis, we used the Process Document operator of RapidMiner with which morphological analysis steps, such as tokenize, convert to lowercase and remove stop words were applied.Figure 3 shows the most frequent terms over the last years.

A. DETECTING TOPICS BY MERGING TOKENS AMONG KEYWORD SECTIONS
The RapidMiner "Process Document" operator was used for the analysis during this phase.The objective of this phase is to detect the merging words to identify the topics of references by year.The number of related terms (n-grams) was configured not to exceed three terms, and this process was also configured to allow the words (token) to have at least three letters.
After configuring the settings, we executed the process using several datasets corresponding to each year.Our goal in this context is to obtain the occurrence of the merging keywords.That is to define the topic of each dataset (each year)-the most frequent words taken into account because these are the ones that represent the dataset.For instance, Table 1 shows the merging keywords that were detected as the most frequent during the year 2019.Accordingly, from the set of extracted merging keywords, it is possible to conclude that the topics were mainly about Healthcare, Big Data Analytics and Machine Learning.As shown in Table 1, we can figure out that the topics for this year (2019) are medical analysis, medical data and machine learning.The same procedure was applied from 2006 to 2019, concluding the following topics in Table 2

Clustering by abstract similarities
For analyzing the 1929 articles, complete abstract references were selected.K-Medoids algorithm was applied to the references by similarities to detect the most discussed topics.This algorithm selected k data items randomly as initial medoids to represent the k cluster and included in a cluster which has its medoid closest to them.The K-Medoids apply the following mathematical equations [26]: where   is the distance between object i and object j.When the pre-determined number of clusters is k, the first k smallest set is selected from the ordered   as the initial medoids.To visualize the clusters, the VOSviewer was used.Fig. 4 shows the topics that were most frequently discussed in the cluster with the largest number of papers grouped.As depicted by Fig. 4, the most relevant topics were: cloud, prediction, internet, cost, etc.
Figure 4. Most frequent words Fig. 5 shows the Network Graph that exposes the cooccurrence of the words.References were clustered into four groups where each colour represents a cluster.Also, it is possible to understand that the topics are most relevant by the size of the bubble.Some of the most pertinent terms are prediction, cost, accuracy, cloud, internet, treatment.

Topics Detected by Clustering
The analysis applied during this phase is mainly concerned with the utilization of clustering techniques to group the references by term similarities.To do this, we have applied the operators of RapidMiner that are depicted in Figure 3. Based on the results obtained during this phase, it is possible to identify the topic of each cluster, for a collection of 142 references grouped by similarities among their abstracts as shown in Table 3.After analyzing the dataset and identifying topic clusters, it is possible to conclude that the focus has been on Data, Patients and Healthcare topics.This is indeed what the results confirm in Table 4.

Merging Words Frequency
Health care 36

Health analytics 34
Healthcare costs 26

Data analytics 23
Big data 20

Data analysis 17
Using Latent Dirichlet Allocation for Topic Modeling Latent Dirichlet Allocation algorithm allows for topic modeling through the inference of the latent structure behind a collection of documents.The main goal in this context is to deliver a "thematic summary" of a set of documents; allowing to answer the question: What themes are these documents discussing?[27].As shown in Figure 6, we have implemented LDA technique as offered by the RapidMiner; carrying out several tests on the used dataset.Figure 6 shows the configuration made that we have set for the LDA algorithm, where we initialized the algorithm with two topics and 50 words per topic.Table 5 shows the results obtained from this process.
After analyzing the words of each topic in Table 5, it is possible to answer the question: What themes are these documents discussing?So, we can say that Topic 1, in general, is about Systems Management and Data Analytics, while Topic 2 is about Big Data, Data Analysis and Healthcare.

V. DISCUSSION AND ANALYSIS
With the constant growth in the number of publications in the medical domain, it has become more challenging to manually carry out a comprehensive literature review covering a wide range of articles over a long period.That is why we must resort to new technologies that allow the automatic treatment of the large volume of articles published in any domain, including the medical domain that we are addressing in this work.
Recent text mining approaches incorporate various techniques that can assist in addressing large-scale and heterogeneous data analysis.This study showed that text mining techniques can be successfully applied to better understand the trending topics that researchers had covered over the period from 2006 to 2019.It is demonstrated that the most frequent keywords, the relationships between keywords and the detection of topics of thousands of publications in the medical literature can be automatically processed and analyzed with high precision.Although there are already several text mining investigations, there are a few that are associated with the medical literature and healthcare domain.Accordingly, this research work contributes to the knowledge of various text mining techniques that can be involved as part of automated literature data analysis, specifically in the area of healthcare.
Regarding the analysis, findings demonstrated that the text mining process applied in RapidMiner allows knowing the frequency of the keywords and topic modeling and analysis over time.For example, it was possible to find out that since 2013 the term Big Data has been introduced in the Healthcare domain.Additional operators and configuration of the employed techniques have led to better explore and identify term co-occurrence statistics and automatic detection of the keywords that have been most frequently used across a huge number of publications.
The analysis made by clustering techniques allows creating several groups according to the data similarities.As such, in this investigation, it was possible to observe that this technique adapts perfectly to the study area.Visualizations made in VOSviewer verified that particular characteristics define each group.It helped in the explanation of the topic in each group.LDA technique on the other hand has helped to detect topics, where it was possible to verify and effectively detect the topics related to the studied articles.It was also possible to confirm by making a comparison between the other techniques applied.Accordingly, this study showed the basic steps that must be followed in Rapidminer to start a text analysis.
From the present study, it is possible to carry out further future research tasks such as: ▪ Semantically-enhanced text mining roles and challenges in the analysis of medical literature.▪ Comparative analysis between existing text mining techniques that have been explored for medical literature related data.

Figure 2 .
Figure 2. Published Papers in the Medical Domain by Year

Figure 3 .
Figure 3. Medical Keyword Frequencies Since 2010 the researchers have been addressing challenges related to data analysis in the Healthcare domain.In 2013, the term Big Data was increasingly used, and in 2017 researchers started to pay greater attention to the utilization of machine learning techniques.So, we can clearly notice that the