A Guide to Dictionary-Based Text Mining

Cook, Helen V.; Jensen, Lars Juhl

doi:10.1007/978-1-4939-9089-4_5

Helen V. Cook^4,5 &
Lars Juhl Jensen⁵

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1939))

3004 Accesses
11 Citations
2 Altmetric

Abstract

PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database 2011:1–13. issn: 17580463. arXiv: baq03. https://doi.org/10.1093/database/baq036
Google Scholar
The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. issn: 0305-1048. http://nar.oxfordjournals.org/content/43/D1/D204. https://doi.org/10.1093/nar/gku989
Google Scholar
Attwood T, Agit B, Ellis L (2015) Longevity of biological databases. EMBnet.journal 21.0 issn: 2226-6089. http://journal.embnet.org/index.php/embnetjournal/article/view/803
Pletscher-Frankild S et al (2015) DISEASES: text mining and data integration of disease-gene associations. Methods 74:83–89. issn: 10959130. https://doi.org/10.1016/j.ymeth.2014.11.020
Google Scholar
Junge A et al (2017) RAIN: RNA-protein association and interaction networks. Database baw167:1–9. issn: 1047- 3211. arXiv: 1611.06654. http://fdslive.oup.com/www.oup.com/pdf/production%7B%5C_%7Din%7B%5C_%7Dprogress.pdf. https://doi.org/10.1093/cercor/bhw393
Google Scholar
Binder JX et al (2014) COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database 1–.9. issn: 17580463. https://doi.org/10.1093/database/bau012
Santos A et al (2015) Comprehensive comparison of large-scale tissue expression datasets. PeerJ 3:e1054. issn: 2167-8359. https://peerj.com/articles/1054. https://doi.org/10.7717/peerj.1054
Google Scholar
Meaney C et al (2016) Text mining describes the use of statistical and epidemiological methods in published medical research. J Clin Epidemiol 74:124–132. issn: 18785921. https://doi.org/10.1016/j.jclinepi.2015.10.020
Google Scholar
IDG Knowledge Management Center (2016) Unexplored opportunities in the druggable genome. Nat Rev Drug Discov http://www.nature.com/nrd/posters/druggablegenome/index.html
Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30:7–18
Google Scholar
Swanson DR, Smalheiserf NR (1996) Undiscovered public knowledge: a ten-year update. KDD-96 Proceedings 56(2):103–118. issn: 00242519. https://doi.org/10.2307/4307965
Google Scholar
Swanson DR (1988) Migraine and magnesium: eleven neglected connections. Perspect Biol Med
Google Scholar
Russo F et al (2018) miRandola 2017: a curated knowledge base of non-invasive biomarkers. Nucleic Acids Res 46:D354–D359. issn: 0305-1048. https://doi.org/10.1093/nar/gkx854
Google Scholar
Orchard S et al (2014) The MIntAct project - IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42(November 2013):358–363. https://doi.org/10.1093/nar/gkt1115
Google Scholar
Xenarios I et al (2002) DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305. issn: 1362-4962. https://doi.org/10.1093/nar/30.1.303
Google Scholar
Bader GD, Betel D, Hogue CWV (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250. issn: 03051048. https://doi.org/10.1093/nar/gkg056
Google Scholar
Rodriguez-Esteban R (2009) Biomedical text mining and its applications. PLoS Comput Biol 5(12):1–5. issn: 1553734X. https://doi.org/10.1371/journal.pcbi.1000597
Google Scholar
Pafilis E et al (2009) Reflect: augmented browsing for the life scientist. Nat Biotechnol 27(6):508–510. issn: 1087- 0156. https://doi.org/10.1038/nbt0609-508
Google Scholar
Pafilis E et al (2013) The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS ONE 8(6):2–7. issn: 19326203. https://doi.org/10.1371/journal.pone.0065390
Google Scholar
Szklarczyk D et al (2016) The STRING database in 2017: quality- controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45(D1):D362–D368. issn: 0305-1048. http://nar.oxfordjournals.org/lookup/. https://doi.org/10.1093/nar/gkw937
Google Scholar
Cook H, Pafilis E, Jensen L (2016) A dictionary- and rule-based system for identification of bacteria and habitats in text. In: Proceedings of the 4th BioNLP shared task workshop, p 50–55. isbn: 978-1-945626-21-0. http://www.aclweb.org/anthology/W/W16/W16-30.pdf%7B%5C#%7Dpage=60
Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7(2):119–129. issn: 1471-0056. http://www.nature.com/doifinder/10.1038/nrg1768. https://doi.org/10.1038/nrg1768
Google Scholar
Arighi CN et al (2014) BioCreative-IV virtual issue. Database 2014:1–6. issn: 1758-0463. https://doi.org/10.1093/database/bau039
Google Scholar
Deléger L et al (2016) Overview of the bacteria biotope task at BioNLP shared task 2016. In: Proceedings of the 4th BioNLP shared task workshop, p 12–22
Google Scholar
Huang CC, Zhiyong L (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 17(1):132–144. issn: 14774054. https://doi.org/10.1093/bib/bbv024
Google Scholar
Yepes AJ, Verspoor K (2014) Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database 2014., bau003. issn: 1758-0463. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3920087%7B%5C&%7Dtool=pmcentrez%7B%5C&%7Drendertype=abstract. https://doi.org/10.1093/database/bau003
Roque FS et al (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol 7(8):e1002141. issn: 1553734X. arXiv: NIHMS150003. https://doi.org/10.1371/journal.pcbi.1002141
Google Scholar
Ford E et al (2016) Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 23(5):1007–1015. issn: 1527974X. https://doi.org/10.1093/jamia/ocv180
Google Scholar
Thomas CE et al. (2014) Negation scope and spelling variation for text-mining of Danish electronic patient records. In: Proceedings of the 5th international workshop on health text mining and information analysis 2014, p 64–68
Google Scholar
Kuhn M et al (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(D1):D1075–D1079. issn: 13624962. https://doi.org/10.1093/nar/gkv1075
Google Scholar
Pafilis E et al (2015) ENVIRONMENTS and EOL: identification of environment ontology terms in text and the annotation of the encyclopedia of life. Bioinformatics 31(11):1872–1874. issn: 14602059. https://doi.org/10.1093/bioinformatics/btv045
Google Scholar
Yang Y et al (2017) Exploiting sequence-based features for predicting enhancer-promoter interactions. Bioinformatics 33(14):i252–i260. issn: 14602059. https://doi.org/10.1093/bioinformatics/btx257
Google Scholar
Sayers E (2010) A general introduction to the E-utilities. National Center for Biotechnology Information (US), Bethesda, MD, pp 1–10
Google Scholar
Westergaard D et al (2017) Text mining of 15 million full-text scientific articles. bioRxiv. https://doi.org/10.1101/162099
Eysenbach G (2006) Citation advantage of open access articles. PLoS Biol 4(5):692–698. issn: 15457885. https://doi.org/10.1371/journal.pbio.0040157
Google Scholar
Handke C, Guibault L, Vallbé JJ (2015) Is Europe falling behind in data mining? Copyright’s impact on data mining in academic research. In: New avenues for electronic publishing in the age of infinite collections and citizen science: scale, openness and trust—Proceedings of the 19th international conference on electronic publishing, Elpub 2015 June (2015), pp. 120–130. issn: 1556-5068. doi: https://doi.org/10.3233/978-1-61499-562-3-120
Noonburg D XpdfReader. http://www.xpdfreader.com/
Ramakrishnan C et al (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7:7. issn: 1751-0473. https://doi.org/10.1186/1751-0473-7-7
Google Scholar
Kim D, Hong Y (2011) Figure text extraction in biomedical literature. PLoS ONE 6(1):1–11. issn: 19326203. https://doi.org/10.1371/journal.pone.0015338
Google Scholar
Free software foundation. iconv. http://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.15/iconv.1.html
Moolenaar B Vim. https://vim.sourceforge.io/
Przybyla P et al (2016) Text mining resources for the life sciences. Database 2016:1–30. issn: 17580463. arXiv: 1611.06654. https://doi.org/10.1093/database/baw145
Google Scholar
Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014, p 740–750. isbn: 9781937284961. https://cs.stanford.edu/%7B~%7Ddanqi/papers/emnlp2014.pdf
Recasens M, De Marneffe MC, Potts C (2013) The life and death of discourse entities: identifying singleton mentions. In: Proceedings of NAACL-HLT 0.June 2013, p 627–633. http://www.aclweb.org/anthology-new/N/N13/N13-1071.pdf
NLTK Project. Natural Language Toolkit http://www.nltk.org/
Sayers EW et al (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res 37:D5–D15 issn: 1362-4962. https://doi.org/10.1093/nar/gkn741
Google Scholar
Gerner M, Nenadic G, Bergman CM. LINNAEUS: a species name identification system for biomedical literature. In: BMC Bioinformatics 111 (2010), p. 85. issn: 1471-2105. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2836304/%7B%5C%%7D5Cn, http://www.biomedcentral.com/1471-2105/11/85. doi: https://doi.org/10.1186/1471-2105-11-85
Leaman R, Zhiyong L (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18):2839–2846. issn: 14602059. https://doi.org/10.1093/bioinformatics/btw343
Google Scholar
Cho H-C et al NERsuite: a named entity recognition toolkit. https://github.com/nlplab/nersuite
Hogenboom F et al (2011) An overview of event extraction from text. CEUR Workshop Proceedings 779:48–57 isbn: 1467392006
Google Scholar
Ramos J (2003) Using TF-IDF to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning 2003, p 1–4. doi: 10.1.1.121.1424
Damashek M (1995) Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199):843–848. issn: 0036-8075. https://doi.org/10.1126/science.267.5199.843
Google Scholar
Björne J, Salakoski T (2015) TEES 2.2: biomedical event extraction for diverse corpora. BMC Bioinformatics 16 Suppl 16 S4. issn: 1471-2105. http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S16-S4. doi: https://doi.org/10.1186/1471-2105-16-S16-S4
Lever J, Jones SJM (2016) VERSE: event and relation extraction in the BioNLP 2016 shared task. In: Proceedings of the 4th BioNLP shared task workshop, 2016, p 42–49
Google Scholar
Mikolov T, Yih W-T, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT 2013, p 746–751. isbn: 9781937284473. http://scholar.google.com/scholar?hl=en%7B%5C&%7DbtnG=Search%7B%5C&%7Dq=intitle:Linguistic+Regularities+in+Continuous+Space+Word+Representations%7B%5C#%7D0%7B%5C%%7D5Cn, https://www.aclweb.org/anthology/N/N13/N13-1090.pdf
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. issn: 10495258. doi: https://doi.org/10.3115/v1/D14-1162. arXiv: 1504.06654.
Bojanowski P et al (2016) Enriching word vectors with subword information. issn: 10450823. arXiv:1607.04606. http://arxiv.org/abs/1607.04606. doi: 1511.09249v1
Pyysalo S et al (2012) Distributional semantics resources for biomedical text processing
Google Scholar
Cejuela JM et al (2014) Tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. Database 2014:1–8. issn: 17580463. https://doi.org/10.1093/database/bau033
Google Scholar
Stenetorp P, Pyysalo S, Topic G Brat rapid annotation tool. http://brat.nlplab.org/
Database Center for Life Science. PubAnnotation. http://www.pubannotation.org/
Johns Hopkins University McKusick-Nathans Institute of Genetic Medicine. Online Mendelian Inheritance in Man, OMIM.
Google Scholar
Law V et al (2014) DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 42(D1):1091–1097. issn: 03051048. https://doi.org/10.1093/nar/gkt1068
Google Scholar
Kanehisa M et al (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45(Database):D353–D361
Google Scholar
Docker Inc. Docker.
Google Scholar
Jupp S et al (2015) A new ontology lookup service at EMBL-EBI. CEUR Workshop Proceedings 1546:118–119 issn: 16130073
Google Scholar
Smith B et al (2007) The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251–1255. issn: 1087-0156. http://www.nature.com/doifinder/10.1038/nbt1346. https://doi.org/10.1038/nbt1346
Google Scholar
Whetzel PL et al (2011) BioPortal: enhanced functionality via new Web services from the national center for biomedical ontology to access and use ontologies in software applications”. In: Nucleic Acids Res 39 SUPPL 2 pp. 541–545. issn: 03051048. doi: https://doi.org/10.1093/nar/gkr469. arXiv:arXiv:1011.1669v3.
Faria D et al (2013) The AgreementMakerLight ontology matching system. Springer, pp 527–541. isbn: 9783642410291. https://doi.org/10.1007/978-3-642-41030-7_38.
Nédellec C (2013) OntoBiotope. In: INRA
Google Scholar
Huerta-Cepas J et al (2015) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44(Database issue):286–293. issn: 0305-1048. https://doi.org/10.1093/nar/gkv1248
Google Scholar
Finkel JR, Kleeman A, Manning CD (2008) Feature-based, conditional random field parsing. In: Proceedings of the 46th meeting of the ACL, 2008, p 959–967
Google Scholar
Tang B et al (2013) Recognizing and encoding disorder concepts in clinical text using machine learning and vector space. In: Proceedings of the ShARe/CLEF Evaluation Lab (2013). issn: 16130073. http://www.clef-initiative.eu/documents/71612/d596ae25-c4b3-4a9a-be4a-648a77712aaf
Zheng J et al (2011) Coreference resolution: a review of general methodologies and applications in the clinical domain. J Biomed Inform 44(6):1113–1122. issn: 15320464. https://doi.org/10.1016/j.jbi.2011.08.006
Google Scholar
Jensen LJ (2017) Personal Communication
Google Scholar
Thompson P et al (2016) Text mining the history of medicine. PLoS ONE 11(1):1–33. issn: 19326203. https://doi.org/10.1371/journal.pone.0144717
Google Scholar

Download references

Author information

Authors and Affiliations

School of Clinical Medicine, University of Cambridge, Cambridge, UK
Helen V. Cook
Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
Helen V. Cook & Lars Juhl Jensen

Authors

Helen V. Cook
View author publications
You can also search for this author in PubMed Google Scholar
Lars Juhl Jensen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Juhl Jensen .

Editor information

Editors and Affiliations

Department of Pathology, University of New Mexico, Albuquerque, NM, USA
Richard S. Larson
Department of Internal Medicine, University of New Mexico, Albuquerque, NM, USA
Tudor I. Oprea

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Cook, H.V., Jensen, L.J. (2019). A Guide to Dictionary-Based Text Mining. In: Larson, R., Oprea, T. (eds) Bioinformatics and Drug Discovery. Methods in Molecular Biology, vol 1939. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-9089-4_5

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9089-4_5
Published: 09 March 2019
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-9088-7
Online ISBN: 978-1-4939-9089-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics