Application of text mining in the biomedical domain
Introduction
The scientific literature provides a wealth of information to researchers. It may serve as a starting point for assessing the state of the art in a particular field, or as a source of information that can be used for building research hypotheses that subsequently can be experimentally validated. Additionally this knowledgebase may serve as a source for interpretation of experimental results.
A large number of bibliographic databases is available in the life sciences domain, and have been reviewed by Masic and Milinovic [1]. One of the most important entry points to scientific literature sources for biomedical research is PubMed which gives access to more than 24 million scientific literature citations from MEDLINE, life science journals, and online books [2].
The number of articles that are added to the literature databases is growing fast. Fig. 1 shows the results of a PubMed search using terms that describe diseases, drugs and model organisms. In all cases, the number of papers that have been published on these subjects has increased exponentially. In addition to the exponential growth of the literature databases, the rate at which experimental data are produced has increased as well. For example in high throughput gene expression profiling or proteomics experiments, regulation of hundreds or thousands of genes and proteins is measured under multiple experimental conditions.
Retrieval of relevant information from literature databases and combining this information with experimental output is time consuming and requires careful selection of keywords and drafting of queries. This is often a biased and time consuming process, resulting in incomplete search results, preventing the realization of the full potential that these databases can offer [3].
Automated processing and analysis of text (referred to as text mining (TM)) can assist researchers in evaluating the scientific literature. Nowadays TM is applied to answer many different research questions, ranging from the discovery of drug targets and biomarkers from high throughput experiments [4], [5], [6], [7], [8], [9] to drug repositioning, the creation of a state-of-the art overview of a certain disease or therapeutic area and for the creation of domain specific databases [10], [11], [12], [13], [14], [15].
Due to the heterogeneous nature of written resources, the automated extraction of relevant biological knowledge is not trivial. As a consequence TM has evolved into a sophisticated and specialized field in the biomedical sciences where text processing and machine learning techniques are combined with mining of biological pathways and gene expression databases.
A number of reviews exists about TM in the biomedical domain that often emphasize the technical aspects of TM and the available tools or focus on gene and protein oriented information and less on the applications and real life research questions that even go beyond gene and protein research [16], [17], [18], [19].
Here we give a state of the art overview of the use of TM for the biomedical domain and drug discovery. First we give a general description of TM, the different steps involved and the types of techniques that are used and describe some publicly available systems for TM. Subsequently we discuss a number of examples in which TM approaches have been applied to solve actual research questions. Finally we present an outlook in which we highlight the opportunities that TM can offer in the near future and the challenges that need to be addressed.
Section snippets
Text mining
A widely accepted definition of text mining has been provided by Marti Hearst, as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different written resources, to reveal otherwise ‘hidden’ meanings” [20]. New hypothesis or facts that are the results of TM can subsequently be validated by experiments.
TM analysis typically involves a number of distinct phases, reviewed among others in [17], [18], [21], [22], which are
Application of TM to biomedical problems
The above section describes the steps in a typical TM workflow. Below we discuss a number of examples in which one or more steps from TM workflows have been used to address biomedical questions.
Outlook
TM pipelines have evolved to a stage where they can be used to efficiently retrieve and analyze the increasing number of research papers in the biomedical field in order to annotate results from high-throughput ∼omics technologies and to support the development of new research hypotheses. However, a number of challenges remain.
References (136)
- et al.
Trends Biotechnol.
(2006) - et al.
Trends Biochem. Sci.
(2001) J. Biomed. Inf.
(2013)J. Biomed. Inf.
(2004)- et al.
Comput. Biol. Chem.
(2004) J. Biomed. Inf.
(2014)- et al.
Comput. Biol. Chem.
(2009) - et al.
J. Biomed. Inf.
(2010) Int. J. Med. Inf.
(2008)Drug Discov. Today
(2013)
Drug Discov. Today
J. Biol. Chem.
Int. J. Med. Inf.
Acta Inf. Med.
Nat. Rev. Genet.
Nucleic Acids Res.
BMC Bioinf.
Proteomics Clin. Appl.
Nucleic Acids Res.
J. Proteome Res.
Nucleic Acids Res.
J. Chem. Inf. Model.
Comput. Methods Programs Biomed.
Comput. Math. Methods Med.
PLoS Comput. Biol.
Drug Discov. Today
Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database
Database (Oxford)
Brief. Bioinf.
Brief. Bioinf.
Genome Biol.
J. Comput. Biol.
Proc. Assoc. Comput. Linguist.
Database (Oxford)
BMC Med. Inf. Decis. Mak.
Bioinformatics
Nucleic Acids Res.
Bioinformatics
Nucleic Acids Res
BMC Bioinf.
BMC Bioinf.
Genome Biol.
Bioinformatics
Physiol. Genomics
J. Biomed. Discov. Collab.
BMC Bioinf.
Database (oxford)
Br. J. Surg.
Comput. Syst. Bioinf. Conf.
Eur. J. Phys. Rehabil. Med.
Cited by (153)
Drug repurposing using real-world data
2023, Drug Discovery TodayA geometric deep learning model for display and prediction of potential drug-virus interactions against SARS-CoV-2
2022, Chemometrics and Intelligent Laboratory SystemsDRPADC: A novel drug repositioning algorithm predicting adaptive drugs for COVID-19
2022, Computers and Chemical EngineeringFactors associated with poor self-management documented in home health care narrative notes for patients with heart failure
2022, Heart and LungCitation Excerpt :Extraction of these data using natural language processing (NLP) holds promise for identifying information to guide optimal self-management. NLP is a collection of computational algorithms that automatically extract, process, and analyze information from written resources to discover clinically significant factors.15,16 Previous studies have successfully applied NLP to identify patients with clinical risk factors,16–21 including falls,19 wounds,20 alcohol and substance abuse status,21 adverse events,22 or hospital utilization,23,24 and to improve risk prediction models.24,25
Document Understanding-Based Design Support: Application of Language Model for Design Knowledge Extraction
2023, Journal of Mechanical Design