Skip to main content

A Guide to Dictionary-Based Text Mining

  • Protocol
  • First Online:
Book cover Bioinformatics and Drug Discovery

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1939))

Abstract

PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database 2011:1–13. issn: 17580463. arXiv: baq03. https://doi.org/10.1093/database/baq036

    Google Scholar 

  2. The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. issn: 0305-1048. http://nar.oxfordjournals.org/content/43/D1/D204. https://doi.org/10.1093/nar/gku989

    Google Scholar 

  3. Attwood T, Agit B, Ellis L (2015) Longevity of biological databases. EMBnet.journal 21.0 issn: 2226-6089. http://journal.embnet.org/index.php/embnetjournal/article/view/803

  4. Pletscher-Frankild S et al (2015) DISEASES: text mining and data integration of disease-gene associations. Methods 74:83–89. issn: 10959130. https://doi.org/10.1016/j.ymeth.2014.11.020

    Google Scholar 

  5. Junge A et al (2017) RAIN: RNA-protein association and interaction networks. Database baw167:1–9. issn: 1047- 3211. arXiv: 1611.06654. http://fdslive.oup.com/www.oup.com/pdf/production%7B%5C_%7Din%7B%5C_%7Dprogress.pdf. https://doi.org/10.1093/cercor/bhw393

    Google Scholar 

  6. Binder JX et al (2014) COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database 1–.9. issn: 17580463. https://doi.org/10.1093/database/bau012

  7. Santos A et al (2015) Comprehensive comparison of large-scale tissue expression datasets. PeerJ 3:e1054. issn: 2167-8359. https://peerj.com/articles/1054. https://doi.org/10.7717/peerj.1054

    Google Scholar 

  8. Meaney C et al (2016) Text mining describes the use of statistical and epidemiological methods in published medical research. J Clin Epidemiol 74:124–132. issn: 18785921. https://doi.org/10.1016/j.jclinepi.2015.10.020

    Google Scholar 

  9. IDG Knowledge Management Center (2016) Unexplored opportunities in the druggable genome. Nat Rev Drug Discov http://www.nature.com/nrd/posters/druggablegenome/index.html

  10. Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30:7–18

    Google Scholar 

  11. Swanson DR, Smalheiserf NR (1996) Undiscovered public knowledge: a ten-year update. KDD-96 Proceedings 56(2):103–118. issn: 00242519. https://doi.org/10.2307/4307965

    Google Scholar 

  12. Swanson DR (1988) Migraine and magnesium: eleven neglected connections. Perspect Biol Med

    Google Scholar 

  13. Russo F et al (2018) miRandola 2017: a curated knowledge base of non-invasive biomarkers. Nucleic Acids Res 46:D354–D359. issn: 0305-1048. https://doi.org/10.1093/nar/gkx854

    Google Scholar 

  14. Orchard S et al (2014) The MIntAct project - IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42(November 2013):358–363. https://doi.org/10.1093/nar/gkt1115

    Google Scholar 

  15. Xenarios I et al (2002) DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305. issn: 1362-4962. https://doi.org/10.1093/nar/30.1.303

    Google Scholar 

  16. Bader GD, Betel D, Hogue CWV (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250. issn: 03051048. https://doi.org/10.1093/nar/gkg056

    Google Scholar 

  17. Rodriguez-Esteban R (2009) Biomedical text mining and its applications. PLoS Comput Biol 5(12):1–5. issn: 1553734X. https://doi.org/10.1371/journal.pcbi.1000597

    Google Scholar 

  18. Pafilis E et al (2009) Reflect: augmented browsing for the life scientist. Nat Biotechnol 27(6):508–510. issn: 1087- 0156. https://doi.org/10.1038/nbt0609-508

    Google Scholar 

  19. Pafilis E et al (2013) The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS ONE 8(6):2–7. issn: 19326203. https://doi.org/10.1371/journal.pone.0065390

    Google Scholar 

  20. Szklarczyk D et al (2016) The STRING database in 2017: quality- controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45(D1):D362–D368. issn: 0305-1048. http://nar.oxfordjournals.org/lookup/. https://doi.org/10.1093/nar/gkw937

    Google Scholar 

  21. Cook H, Pafilis E, Jensen L (2016) A dictionary- and rule-based system for identification of bacteria and habitats in text. In: Proceedings of the 4th BioNLP shared task workshop, p 50–55. isbn: 978-1-945626-21-0. http://www.aclweb.org/anthology/W/W16/W16-30.pdf%7B%5C#%7Dpage=60

  22. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7(2):119–129. issn: 1471-0056. http://www.nature.com/doifinder/10.1038/nrg1768. https://doi.org/10.1038/nrg1768

    Google Scholar 

  23. Arighi CN et al (2014) BioCreative-IV virtual issue. Database 2014:1–6. issn: 1758-0463. https://doi.org/10.1093/database/bau039

    Google Scholar 

  24. Deléger L et al (2016) Overview of the bacteria biotope task at BioNLP shared task 2016. In: Proceedings of the 4th BioNLP shared task workshop, p 12–22

    Google Scholar 

  25. Huang CC, Zhiyong L (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 17(1):132–144. issn: 14774054. https://doi.org/10.1093/bib/bbv024

    Google Scholar 

  26. Yepes AJ, Verspoor K (2014) Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database 2014., bau003. issn: 1758-0463. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3920087%7B%5C&%7Dtool=pmcentrez%7B%5C&%7Drendertype=abstract. https://doi.org/10.1093/database/bau003

  27. Roque FS et al (2011) Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol 7(8):e1002141. issn: 1553734X. arXiv: NIHMS150003. https://doi.org/10.1371/journal.pcbi.1002141

    Google Scholar 

  28. Ford E et al (2016) Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 23(5):1007–1015. issn: 1527974X. https://doi.org/10.1093/jamia/ocv180

    Google Scholar 

  29. Thomas CE et al. (2014) Negation scope and spelling variation for text-mining of Danish electronic patient records. In: Proceedings of the 5th international workshop on health text mining and information analysis 2014, p 64–68

    Google Scholar 

  30. Kuhn M et al (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(D1):D1075–D1079. issn: 13624962. https://doi.org/10.1093/nar/gkv1075

    Google Scholar 

  31. Pafilis E et al (2015) ENVIRONMENTS and EOL: identification of environment ontology terms in text and the annotation of the encyclopedia of life. Bioinformatics 31(11):1872–1874. issn: 14602059. https://doi.org/10.1093/bioinformatics/btv045

    Google Scholar 

  32. Yang Y et al (2017) Exploiting sequence-based features for predicting enhancer-promoter interactions. Bioinformatics 33(14):i252–i260. issn: 14602059. https://doi.org/10.1093/bioinformatics/btx257

    Google Scholar 

  33. Sayers E (2010) A general introduction to the E-utilities. National Center for Biotechnology Information (US), Bethesda, MD, pp 1–10

    Google Scholar 

  34. Westergaard D et al (2017) Text mining of 15 million full-text scientific articles. bioRxiv. https://doi.org/10.1101/162099

  35. Eysenbach G (2006) Citation advantage of open access articles. PLoS Biol 4(5):692–698. issn: 15457885. https://doi.org/10.1371/journal.pbio.0040157

    Google Scholar 

  36. Handke C, Guibault L, Vallbé JJ (2015) Is Europe falling behind in data mining? Copyright’s impact on data mining in academic research. In: New avenues for electronic publishing in the age of infinite collections and citizen science: scale, openness and trust—Proceedings of the 19th international conference on electronic publishing, Elpub 2015 June (2015), pp. 120–130. issn: 1556-5068. doi: https://doi.org/10.3233/978-1-61499-562-3-120

  37. Noonburg D XpdfReader. http://www.xpdfreader.com/

  38. Ramakrishnan C et al (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7:7. issn: 1751-0473. https://doi.org/10.1186/1751-0473-7-7

    Google Scholar 

  39. Kim D, Hong Y (2011) Figure text extraction in biomedical literature. PLoS ONE 6(1):1–11. issn: 19326203. https://doi.org/10.1371/journal.pone.0015338

    Google Scholar 

  40. Free software foundation. iconv. http://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.15/iconv.1.html

  41. Moolenaar B Vim. https://vim.sourceforge.io/

  42. Przybyla P et al (2016) Text mining resources for the life sciences. Database 2016:1–30. issn: 17580463. arXiv: 1611.06654. https://doi.org/10.1093/database/baw145

    Google Scholar 

  43. Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014, p 740–750. isbn: 9781937284961. https://cs.stanford.edu/%7B~%7Ddanqi/papers/emnlp2014.pdf

  44. Recasens M, De Marneffe MC, Potts C (2013) The life and death of discourse entities: identifying singleton mentions. In: Proceedings of NAACL-HLT 0.June 2013, p 627–633. http://www.aclweb.org/anthology-new/N/N13/N13-1071.pdf

  45. NLTK Project. Natural Language Toolkit http://www.nltk.org/

  46. Sayers EW et al (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res 37:D5–D15 issn: 1362-4962. https://doi.org/10.1093/nar/gkn741

    Google Scholar 

  47. Gerner M, Nenadic G, Bergman CM. LINNAEUS: a species name identification system for biomedical literature. In: BMC Bioinformatics 111 (2010), p. 85. issn: 1471-2105. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2836304/%7B%5C%%7D5Cn, http://www.biomedcentral.com/1471-2105/11/85. doi: https://doi.org/10.1186/1471-2105-11-85

  48. Leaman R, Zhiyong L (2016) TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics 32(18):2839–2846. issn: 14602059. https://doi.org/10.1093/bioinformatics/btw343

    Google Scholar 

  49. Cho H-C et al NERsuite: a named entity recognition toolkit. https://github.com/nlplab/nersuite

  50. Hogenboom F et al (2011) An overview of event extraction from text. CEUR Workshop Proceedings 779:48–57 isbn: 1467392006

    Google Scholar 

  51. Ramos J (2003) Using TF-IDF to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning 2003, p 1–4. doi: 10.1.1.121.1424

  52. Damashek M (1995) Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199):843–848. issn: 0036-8075. https://doi.org/10.1126/science.267.5199.843

    Google Scholar 

  53. Björne J, Salakoski T (2015) TEES 2.2: biomedical event extraction for diverse corpora. BMC Bioinformatics 16 Suppl 16 S4. issn: 1471-2105. http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S16-S4. doi: https://doi.org/10.1186/1471-2105-16-S16-S4

  54. Lever J, Jones SJM (2016) VERSE: event and relation extraction in the BioNLP 2016 shared task. In: Proceedings of the 4th BioNLP shared task workshop, 2016, p 42–49

    Google Scholar 

  55. Mikolov T, Yih W-T, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT 2013, p 746–751. isbn: 9781937284473. http://scholar.google.com/scholar?hl=en%7B%5C&%7DbtnG=Search%7B%5C&%7Dq=intitle:Linguistic+Regularities+in+Continuous+Space+Word+Representations%7B%5C#%7D0%7B%5C%%7D5Cn, https://www.aclweb.org/anthology/N/N13/N13-1090.pdf

  56. Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. issn: 10495258. doi: https://doi.org/10.3115/v1/D14-1162. arXiv: 1504.06654.

  57. Bojanowski P et al (2016) Enriching word vectors with subword information. issn: 10450823. arXiv:1607.04606. http://arxiv.org/abs/1607.04606. doi: 1511.09249v1

  58. Pyysalo S et al (2012) Distributional semantics resources for biomedical text processing

    Google Scholar 

  59. Cejuela JM et al (2014) Tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles. Database 2014:1–8. issn: 17580463. https://doi.org/10.1093/database/bau033

    Google Scholar 

  60. Stenetorp P, Pyysalo S, Topic G Brat rapid annotation tool. http://brat.nlplab.org/

  61. Database Center for Life Science. PubAnnotation. http://www.pubannotation.org/

  62. Johns Hopkins University McKusick-Nathans Institute of Genetic Medicine. Online Mendelian Inheritance in Man, OMIM.

    Google Scholar 

  63. Law V et al (2014) DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res 42(D1):1091–1097. issn: 03051048. https://doi.org/10.1093/nar/gkt1068

    Google Scholar 

  64. Kanehisa M et al (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45(Database):D353–D361

    Google Scholar 

  65. Docker Inc. Docker.

    Google Scholar 

  66. Jupp S et al (2015) A new ontology lookup service at EMBL-EBI. CEUR Workshop Proceedings 1546:118–119 issn: 16130073

    Google Scholar 

  67. Smith B et al (2007) The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251–1255. issn: 1087-0156. http://www.nature.com/doifinder/10.1038/nbt1346. https://doi.org/10.1038/nbt1346

    Google Scholar 

  68. Whetzel PL et al (2011) BioPortal: enhanced functionality via new Web services from the national center for biomedical ontology to access and use ontologies in software applications”. In: Nucleic Acids Res 39 SUPPL 2 pp. 541–545. issn: 03051048. doi: https://doi.org/10.1093/nar/gkr469. arXiv:arXiv:1011.1669v3.

  69. Faria D et al (2013) The AgreementMakerLight ontology matching system. Springer, pp 527–541. isbn: 9783642410291. https://doi.org/10.1007/978-3-642-41030-7_38.

  70. Nédellec C (2013) OntoBiotope. In: INRA

    Google Scholar 

  71. Huerta-Cepas J et al (2015) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44(Database issue):286–293. issn: 0305-1048. https://doi.org/10.1093/nar/gkv1248

    Google Scholar 

  72. Finkel JR, Kleeman A, Manning CD (2008) Feature-based, conditional random field parsing. In: Proceedings of the 46th meeting of the ACL, 2008, p 959–967

    Google Scholar 

  73. Tang B et al (2013) Recognizing and encoding disorder concepts in clinical text using machine learning and vector space. In: Proceedings of the ShARe/CLEF Evaluation Lab (2013). issn: 16130073. http://www.clef-initiative.eu/documents/71612/d596ae25-c4b3-4a9a-be4a-648a77712aaf

  74. Zheng J et al (2011) Coreference resolution: a review of general methodologies and applications in the clinical domain. J Biomed Inform 44(6):1113–1122. issn: 15320464. https://doi.org/10.1016/j.jbi.2011.08.006

    Google Scholar 

  75. Jensen LJ (2017) Personal Communication

    Google Scholar 

  76. Thompson P et al (2016) Text mining the history of medicine. PLoS ONE 11(1):1–33. issn: 19326203. https://doi.org/10.1371/journal.pone.0144717

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Juhl Jensen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Cook, H.V., Jensen, L.J. (2019). A Guide to Dictionary-Based Text Mining. In: Larson, R., Oprea, T. (eds) Bioinformatics and Drug Discovery. Methods in Molecular Biology, vol 1939. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-9089-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-9089-4_5

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-9088-7

  • Online ISBN: 978-1-4939-9089-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics