skip to main content
10.1145/2649387.2649444acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
short-paper

Challenges in adapting text mining for full text articles to assist pathway curation

Published:20 September 2014Publication History

ABSTRACT

Annotation of biological pathway databases is largely driven by manual effort with little assistance from text mining. It is a great challenge to the pathway curators to keep up with the pace of ever-growing literature. There have been recent efforts to fill this gap through text mining by identifying the relevant papers and the textual evidence pertaining to pathway information. In the current work, we evaluated the performance of a text mining system that extracts events describing molecular pathways from full text articles and its potential role in assisting manual curation of pathway databases. We specifically investigated the merits of mining full text articles for extracting pathway events by comparing the performance of our system on both full text articles and biomedical abstracts. From the preliminary results, we observed that by processing full text articles the performance of the system improved by nearly 22% against a small drop of 5% in the precision in comparison against the extractions from PubMed abstracts. Preliminary analysis of the text mining results for selected pathways from PharmGKB suggest that the pathway curators do use their biological knowledge to infer new information that go beyond what is often expressed in either the full text articles or abstracts. This study is an attempt to identify the magnitude of gaps that exist between the text mining deliverables and the demands of pathway curation.

References

  1. Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G, Hunter L (2007) Manual curation is not sufficient for annotation of genomic databases. Bioinformatics 23: i41--i48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Björne J, Ginter F, Pyysalo S, Tsujii Ji, Salakoski T (2010) Complex event extraction at PubMed scale. Bioinformatics 26: i382--i390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Björne J, Heimonen J, Ginter F, Airola A, Pahikkala T, et al. (2009) Extracting complex biological events with rich graph-based feature sets;. Association for Computational Linguistics. pp. 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5: 147.Google ScholarGoogle ScholarCross RefCross Ref
  5. Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of biomedical informatics 37: 43--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, et al. (2008) OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC bioinformatics 9: 78.Google ScholarGoogle Scholar
  7. Bandy J, Milward D, McQuay S (2009) Mining protein--protein interactions from published literature using Linguamatics I2E. Protein Networks and Pathway Analysis: Springer. pp. 3--13.Google ScholarGoogle Scholar
  8. Kemper B, Matsuzaki T, Matsuoka Y, Tsuruoka Y, Kitano H, et al. (2010) PathText: a text mining integrator for biological pathway visualizations. Bioinformatics 26: i374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ohta T, Matsuzaki T, Okazaki N, Miwa M, Sætre R, et al. (2010) Medie and Info-pubmed: 2010 update. BMC Bioinformatics 11: P7.Google ScholarGoogle ScholarCross RefCross Ref
  10. Nobata C, Cotter P, Okazaki N, Rea B, Sasaki Y, et al. (2008) Kleio: a knowledge-enriched information retrieval system for biology; Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. pp. 787--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tsuruoka Y, Tsujii, J. and Ananiadou, S. (2008) FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24: 2559--2560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cornish-Bowden A, Hunter P, Cuellar A, Mjolsness E, Juty N, et al. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19: 524--531.Google ScholarGoogle ScholarCross RefCross Ref
  13. Le Novere N, Hucka M, Mi H, Moodie S, Schreiber F, et al. (2009) The systems biology graphical notation. Nature biotechnology 27: 735--741.Google ScholarGoogle Scholar
  14. Björne J, Salakoski T (2013) TEES 2.1: Automated annotation scheme learning in the BioNLP 2013 Shared Task. ACL 2013: 16.Google ScholarGoogle Scholar
  15. Miwa M, Ananiadou S (2013) NaCTeM EventMine for BioNLP 2013 CG and PC tasks. Acl 2013: 94.Google ScholarGoogle Scholar
  16. Ohta T, Pyysalo S, Rak R, Rowley A, Chun H-W, et al. (2013) Overview of the pathway curation (PC) task of BioNLP shared task 2013, Sofia Bulgaria, August 9 2013, pp. 67--75.Google ScholarGoogle Scholar
  17. Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, et al. (2002) PharmGKB: the pharmacogenetics knowledge base. Nucleic acids research 30: 163--165.Google ScholarGoogle Scholar
  18. Ramakrishnan C, Patnia A, Hovy EH, Burns GAPC (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source code for biology and medicine 7: 7.Google ScholarGoogle Scholar
  19. Cohen KB, Christiansen T, Hunter LE. (2011) Parenthetically speaking: Classifying the contents of parentheses for text mining;. AMIA annual symposium proceedings. pp. 267--272.Google ScholarGoogle Scholar
  20. GENIA Tagger, http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/.Google ScholarGoogle Scholar
  21. Huang M, Liu J, Zhu X (2011) GeneTUKit: a software for document-level gene normalization. Bioinformatics 27: 1032--1033. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Torii M, Wagholikar K, Liu H (2011) Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association 18: 580--587.Google ScholarGoogle ScholarCross RefCross Ref
  23. Leaman R, Dogan RI, Lu, Z (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics 29: 2909--2917.Google ScholarGoogle ScholarCross RefCross Ref
  24. Rocktäschel T, Weidlich M, Leser U (2012) ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics 28: 1633--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. (2003) A biological named entity recognizer; Pacific Symposium on Biocomputing, 3rd-7th January, Hawaii, 8: 427--438.Google ScholarGoogle Scholar
  26. Maglott D, Ostell J, Pruitt KD, Tatusova T (2005) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research 33: D54--D58.Google ScholarGoogle ScholarCross RefCross Ref
  27. MeSH Ontology http://www.nlm.nih.gov/mesh/Google ScholarGoogle Scholar
  28. Mattingly CJ, Rosenstein MC, Colby GT, Forrest Jr JN, Boyer JL (2006) The Comparative Toxicogenomics Database (CTD): a resource for comparative toxicological studies. Journal of Experimental Zoology Part A: Comparative Experimental Biology 305: 689--692.Google ScholarGoogle ScholarCross RefCross Ref
  29. Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, et al. (2012) BRAT: a web-based tool for NLP-assisted text annotation; Association for Computational Linguistics. pp. 102--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ravikumar KE, Wagholikar KB, Liu H. (2014) Towards pathway curation through literature mining-a case study using pharmgkb; Pacific symposium of Biocomputing 2014, 3rd-7th January, Hawii, pp. 352--363.Google ScholarGoogle Scholar
  31. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004) UniProt: the universal protein knowledgebase. Nucleic acids research 32: D115.Google ScholarGoogle Scholar
  32. Klyne G, Carroll JJ (2006) Resource description framework (RDF): Concepts and abstract syntax.Google ScholarGoogle Scholar
  33. BEL (2013) Biological Expression Language; http://www.openbel.org/.Google ScholarGoogle Scholar
  34. Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, et al. (2011) Controlled vocabularies and semantics in systems biology. Molecular systems biology 7.Google ScholarGoogle Scholar

Index Terms

  1. Challenges in adapting text mining for full text articles to assist pathway curation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
            September 2014
            851 pages
            ISBN:9781450328944
            DOI:10.1145/2649387
            • General Chairs:
            • Pierre Baldi,
            • Wei Wang

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 September 2014

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • short-paper

            Acceptance Rates

            Overall Acceptance Rate254of885submissions,29%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader