ABSTRACT
Annotation of biological pathway databases is largely driven by manual effort with little assistance from text mining. It is a great challenge to the pathway curators to keep up with the pace of ever-growing literature. There have been recent efforts to fill this gap through text mining by identifying the relevant papers and the textual evidence pertaining to pathway information. In the current work, we evaluated the performance of a text mining system that extracts events describing molecular pathways from full text articles and its potential role in assisting manual curation of pathway databases. We specifically investigated the merits of mining full text articles for extracting pathway events by comparing the performance of our system on both full text articles and biomedical abstracts. From the preliminary results, we observed that by processing full text articles the performance of the system improved by nearly 22% against a small drop of 5% in the precision in comparison against the extractions from PubMed abstracts. Preliminary analysis of the text mining results for selected pathways from PharmGKB suggest that the pathway curators do use their biological knowledge to infer new information that go beyond what is often expressed in either the full text articles or abstracts. This study is an attempt to identify the magnitude of gaps that exist between the text mining deliverables and the demands of pathway curation.
- Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G, Hunter L (2007) Manual curation is not sufficient for annotation of genomic databases. Bioinformatics 23: i41--i48. Google ScholarDigital Library
- Björne J, Ginter F, Pyysalo S, Tsujii Ji, Salakoski T (2010) Complex event extraction at PubMed scale. Bioinformatics 26: i382--i390. Google ScholarDigital Library
- Björne J, Heimonen J, Ginter F, Airola A, Pahikkala T, et al. (2009) Extracting complex biological events with rich graph-based feature sets;. Association for Computational Linguistics. pp. 10--18. Google ScholarDigital Library
- Chen H, Sharp BM (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5: 147.Google ScholarCross Ref
- Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of biomedical informatics 37: 43--53. Google ScholarDigital Library
- Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, et al. (2008) OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC bioinformatics 9: 78.Google Scholar
- Bandy J, Milward D, McQuay S (2009) Mining protein--protein interactions from published literature using Linguamatics I2E. Protein Networks and Pathway Analysis: Springer. pp. 3--13.Google Scholar
- Kemper B, Matsuzaki T, Matsuoka Y, Tsuruoka Y, Kitano H, et al. (2010) PathText: a text mining integrator for biological pathway visualizations. Bioinformatics 26: i374. Google ScholarDigital Library
- Ohta T, Matsuzaki T, Okazaki N, Miwa M, Sætre R, et al. (2010) Medie and Info-pubmed: 2010 update. BMC Bioinformatics 11: P7.Google ScholarCross Ref
- Nobata C, Cotter P, Okazaki N, Rea B, Sasaki Y, et al. (2008) Kleio: a knowledge-enriched information retrieval system for biology; Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. pp. 787--788. Google ScholarDigital Library
- Tsuruoka Y, Tsujii, J. and Ananiadou, S. (2008) FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24: 2559--2560. Google ScholarDigital Library
- Cornish-Bowden A, Hunter P, Cuellar A, Mjolsness E, Juty N, et al. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19: 524--531.Google ScholarCross Ref
- Le Novere N, Hucka M, Mi H, Moodie S, Schreiber F, et al. (2009) The systems biology graphical notation. Nature biotechnology 27: 735--741.Google Scholar
- Björne J, Salakoski T (2013) TEES 2.1: Automated annotation scheme learning in the BioNLP 2013 Shared Task. ACL 2013: 16.Google Scholar
- Miwa M, Ananiadou S (2013) NaCTeM EventMine for BioNLP 2013 CG and PC tasks. Acl 2013: 94.Google Scholar
- Ohta T, Pyysalo S, Rak R, Rowley A, Chun H-W, et al. (2013) Overview of the pathway curation (PC) task of BioNLP shared task 2013, Sofia Bulgaria, August 9 2013, pp. 67--75.Google Scholar
- Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, et al. (2002) PharmGKB: the pharmacogenetics knowledge base. Nucleic acids research 30: 163--165.Google Scholar
- Ramakrishnan C, Patnia A, Hovy EH, Burns GAPC (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source code for biology and medicine 7: 7.Google Scholar
- Cohen KB, Christiansen T, Hunter LE. (2011) Parenthetically speaking: Classifying the contents of parentheses for text mining;. AMIA annual symposium proceedings. pp. 267--272.Google Scholar
- GENIA Tagger, http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/.Google Scholar
- Huang M, Liu J, Zhu X (2011) GeneTUKit: a software for document-level gene normalization. Bioinformatics 27: 1032--1033. Google ScholarDigital Library
- Torii M, Wagholikar K, Liu H (2011) Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association 18: 580--587.Google ScholarCross Ref
- Leaman R, Dogan RI, Lu, Z (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics 29: 2909--2917.Google ScholarCross Ref
- Rocktäschel T, Weidlich M, Leser U (2012) ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics 28: 1633--1640. Google ScholarDigital Library
- Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. (2003) A biological named entity recognizer; Pacific Symposium on Biocomputing, 3rd-7th January, Hawaii, 8: 427--438.Google Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T (2005) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research 33: D54--D58.Google ScholarCross Ref
- MeSH Ontology http://www.nlm.nih.gov/mesh/Google Scholar
- Mattingly CJ, Rosenstein MC, Colby GT, Forrest Jr JN, Boyer JL (2006) The Comparative Toxicogenomics Database (CTD): a resource for comparative toxicological studies. Journal of Experimental Zoology Part A: Comparative Experimental Biology 305: 689--692.Google ScholarCross Ref
- Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, et al. (2012) BRAT: a web-based tool for NLP-assisted text annotation; Association for Computational Linguistics. pp. 102--107. Google ScholarDigital Library
- Ravikumar KE, Wagholikar KB, Liu H. (2014) Towards pathway curation through literature mining-a case study using pharmgkb; Pacific symposium of Biocomputing 2014, 3rd-7th January, Hawii, pp. 352--363.Google Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004) UniProt: the universal protein knowledgebase. Nucleic acids research 32: D115.Google Scholar
- Klyne G, Carroll JJ (2006) Resource description framework (RDF): Concepts and abstract syntax.Google Scholar
- BEL (2013) Biological Expression Language; http://www.openbel.org/.Google Scholar
- Courtot M, Juty N, Knüpfer C, Waltemath D, Zhukova A, et al. (2011) Controlled vocabularies and semantics in systems biology. Molecular systems biology 7.Google Scholar
Index Terms
- Challenges in adapting text mining for full text articles to assist pathway curation
Recommendations
Using PharmGKB to train text mining approaches for identifying potential gene targets for pharmacogenomic studies
Graphical abstractDisplay Omitted Highlights Automatic classification of drug-gene relations is 85% sensitive and 69% specific. Our approach prospectively found more gene targets than manual search. Our approach identified new gene targets for ...
Identifying Disease Genes Based on Functional Annotation and Text Mining
The identification of disease genes from candidated regions is one of the most important tasks in bioinformatics research. Most approaches based on function annotations cannot be used to identify genes for diseases without any known pathogenic genes or ...
Using expression data to help pathway curation
BIBMW '12: Proceedings of the 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)Pathway models for organisms beyond the most popular model organisms are often notoriously incomplete, even for commercially important species such as gallus gallus. This can make experimental expression data hard to interpret. The paper describes ...
Comments