Elsevier

Methods

Volume 74, 1 March 2015, Pages 97-106
Methods

Application of text mining in the biomedical domain

https://doi.org/10.1016/j.ymeth.2015.01.015Get rights and content

Highlights

  • The number of biomedical data and research papers grows exponentially.

  • We give an overview of how text mining is used to extract meaningful facts from these data.

  • Text mining workflows consist of information retrieval, information extraction and knowledge discovery.

  • The underlying principles of text mining are explained.

  • Application to biomedical use cases is discussed.

Abstract

In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for.

Introduction

The scientific literature provides a wealth of information to researchers. It may serve as a starting point for assessing the state of the art in a particular field, or as a source of information that can be used for building research hypotheses that subsequently can be experimentally validated. Additionally this knowledgebase may serve as a source for interpretation of experimental results.

A large number of bibliographic databases is available in the life sciences domain, and have been reviewed by Masic and Milinovic [1]. One of the most important entry points to scientific literature sources for biomedical research is PubMed which gives access to more than 24 million scientific literature citations from MEDLINE, life science journals, and online books [2].

The number of articles that are added to the literature databases is growing fast. Fig. 1 shows the results of a PubMed search using terms that describe diseases, drugs and model organisms. In all cases, the number of papers that have been published on these subjects has increased exponentially. In addition to the exponential growth of the literature databases, the rate at which experimental data are produced has increased as well. For example in high throughput gene expression profiling or proteomics experiments, regulation of hundreds or thousands of genes and proteins is measured under multiple experimental conditions.

Retrieval of relevant information from literature databases and combining this information with experimental output is time consuming and requires careful selection of keywords and drafting of queries. This is often a biased and time consuming process, resulting in incomplete search results, preventing the realization of the full potential that these databases can offer [3].

Automated processing and analysis of text (referred to as text mining (TM)) can assist researchers in evaluating the scientific literature. Nowadays TM is applied to answer many different research questions, ranging from the discovery of drug targets and biomarkers from high throughput experiments [4], [5], [6], [7], [8], [9] to drug repositioning, the creation of a state-of-the art overview of a certain disease or therapeutic area and for the creation of domain specific databases [10], [11], [12], [13], [14], [15].

Due to the heterogeneous nature of written resources, the automated extraction of relevant biological knowledge is not trivial. As a consequence TM has evolved into a sophisticated and specialized field in the biomedical sciences where text processing and machine learning techniques are combined with mining of biological pathways and gene expression databases.

A number of reviews exists about TM in the biomedical domain that often emphasize the technical aspects of TM and the available tools or focus on gene and protein oriented information and less on the applications and real life research questions that even go beyond gene and protein research [16], [17], [18], [19].

Here we give a state of the art overview of the use of TM for the biomedical domain and drug discovery. First we give a general description of TM, the different steps involved and the types of techniques that are used and describe some publicly available systems for TM. Subsequently we discuss a number of examples in which TM approaches have been applied to solve actual research questions. Finally we present an outlook in which we highlight the opportunities that TM can offer in the near future and the challenges that need to be addressed.

Section snippets

Text mining

A widely accepted definition of text mining has been provided by Marti Hearst, as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different written resources, to reveal otherwise ‘hidden’ meanings” [20]. New hypothesis or facts that are the results of TM can subsequently be validated by experiments.

TM analysis typically involves a number of distinct phases, reviewed among others in [17], [18], [21], [22], which are

Application of TM to biomedical problems

The above section describes the steps in a typical TM workflow. Below we discuss a number of examples in which one or more steps from TM workflows have been used to address biomedical questions.

Outlook

TM pipelines have evolved to a stage where they can be used to efficiently retrieve and analyze the increasing number of research papers in the biomedical field in order to annotate results from high-throughput ∼omics technologies and to support the development of new research hypotheses. However, a number of challenges remain.

References (136)

  • S. Ananiadou et al.

    Trends Biotechnol.

    (2006)
  • C. Perez-Iratxeta et al.

    Trends Biochem. Sci.

    (2001)
  • K.C. Huang

    J. Biomed. Inf.

    (2013)
  • J. Zhang

    J. Biomed. Inf.

    (2004)
  • L. Yeganova et al.

    Comput. Biol. Chem.

    (2004)
  • M. Skeppstedt

    J. Biomed. Inf.

    (2014)
  • L. Li et al.

    Comput. Biol. Chem.

    (2009)
  • N.C. Baker et al.

    J. Biomed. Inf.

    (2010)
  • R. Jelier

    Int. J. Med. Inf.

    (2008)
  • I. Harrow

    Drug Discov. Today

    (2013)
  • A.J. Williams

    Drug Discov. Today

    (2012)
  • T.B. Hakvoort

    J. Biol. Chem.

    (2011)
  • D. Hristovski

    Int. J. Med. Inf.

    (2005)
  • I. Masic et al.

    Acta Inf. Med.

    (2012)
  • PubMed. Available from:...
  • L.J. Jensen et al.

    Nat. Rev. Genet.

    (2006)
  • C. Plake

    Nucleic Acids Res.

    (2009)
  • Z.X. Huang

    BMC Bioinf.

    (2008)
  • A. Kentsis

    Proteomics Clin. Appl.

    (2009)
  • F. Al-Shahrour

    Nucleic Acids Res.

    (2007)
  • A.S. Haqqani

    J. Proteome Res.

    (2007)
  • W.W. Fleuren

    Nucleic Acids Res.

    (2011)
  • Y. Pan

    J. Chem. Inf. Model.

    (2014)
  • R.A. Abul Seoud et al.

    Comput. Methods Programs Biomed.

    (2013)
  • H. Li et al.

    Comput. Math. Methods Med.

    (2012)
  • K. Jensen et al.

    PLoS Comput. Biol.

    (2014)
  • D. Rebholz-Schuhmann

    Drug Discov. Today

    (2013)
  • D.G. Jamieson

    Towards semi-automated curation: using text mining to recreate the HIV-1, human protein interaction database

    Database (Oxford)

    (2012)
  • J.J. Kim et al.

    Brief. Bioinf.

    (2008)
  • P. Zweigenbaum

    Brief. Bioinf.

    (2007)
  • M. Krallinger et al.

    Genome Biol.

    (2005)
  • H. Shatkay et al.

    J. Comput. Biol.

    (2003)
  • M. Hearst

    Proc. Assoc. Comput. Linguist.

    (1999)
  • L. Hirschman

    Database (Oxford)

    (2012)
  • P. Fontelo et al.

    BMC Med. Inf. Decis. Mak.

    (2005)
  • J. Lewis

    Bioinformatics

    (2006)
  • J.F. Fontaine

    Nucleic Acids Res.

    (2009)
  • D.J. States

    Bioinformatics

    (2009)
  • K. Hokamp et al.

    Nucleic Acids Res

    (2004)
  • M.V. Plikus et al.

    BMC Bioinf.

    (2006)
  • K.G. Becker

    BMC Bioinf.

    (2003)
  • S.M. Douglas et al.

    Genome Biol.

    (2005)
  • B. Brancotte

    Bioinformatics

    (2011)
  • S. De

    Physiol. Genomics

    (2010)
  • N.R. Smalheiser et al.

    J. Biomed. Discov. Collab.

    (2008)
  • H. Chen et al.

    BMC Bioinf.

    (2004)
  • C. Li

    Database (oxford)

    (2013)
  • R.W. Glynn et al.

    Br. J. Surg.

    (2010)
  • W. Xuan

    Comput. Syst. Bioinf. Conf.

    (2007)
  • E. Giglia

    Eur. J. Phys. Rehabil. Med.

    (2011)
  • Cited by (153)

    • Drug repurposing using real-world data

      2023, Drug Discovery Today
    • Factors associated with poor self-management documented in home health care narrative notes for patients with heart failure

      2022, Heart and Lung
      Citation Excerpt :

      Extraction of these data using natural language processing (NLP) holds promise for identifying information to guide optimal self-management. NLP is a collection of computational algorithms that automatically extract, process, and analyze information from written resources to discover clinically significant factors.15,16 Previous studies have successfully applied NLP to identify patients with clinical risk factors,16–21 including falls,19 wounds,20 alcohol and substance abuse status,21 adverse events,22 or hospital utilization,23,24 and to improve risk prediction models.24,25

    View all citing articles on Scopus
    View full text