skip to main content
research-article

Automatic rule refinement for information extraction

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.

References

  1. Database languages -- SQL -- Part 1: Framework (SQL/Framework). Technical report. ISO/IEC 9075--1:2003.Google ScholarGoogle Scholar
  2. The Enron corpus. www.cs.cmu.edu/enron/.Google ScholarGoogle Scholar
  3. Automatic Content Extraction 2005 Evaluation Dataset, 2005.Google ScholarGoogle Scholar
  4. E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. E. Appelt and B. Onyshkevych. The Common Pattern Specification Language. In TIPSTER workshop, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. L. Bentley. Programming Pearls: Algorithm Design Techniques. Commun. ACM, 27(9):865--873, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Boguraev. Annotation-based Finite State Processing in a Large-Scale NLP Architecture. In RANLP, 2003.Google ScholarGoogle Scholar
  9. A. Chapman and H. V. Jagadish. Why Not? In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Cheney, L. Chiticariu, and W. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06, University of Sheffield, May 1999.Google ScholarGoogle Scholar
  13. A. Das Sarma, A. Jain, and D. Srivastava. I4E: Interactive Investigation of Iterative Information Extraction. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. DeJong. An Overview of the FRUMP System. In Strategies for Natural language Processing. 1982.Google ScholarGoogle Scholar
  15. D. Freitag. Multistrategy Learning for Information Extraction. In ICML, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance Semirings. In PODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Herschel and M. Hernandez. Explaining Missing Answers to SPJUA Queries. PVLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. R. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, and M. Tyson. FASTUS: a System for Extracting Information from Text. In HLT, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jain, P. Ipeirotis, A. Doan, and L. Gravano. Join Optimization of Information Extraction Output: Quality Matters! In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a System for Declarative Information Extraction. SIGMOD Record, 37(4):7--13, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan, and S. Goldman. UMass/Hughes: Description of the CIRCUS System Used for MUC-5. In MUC, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular Expression Learning for Information Extraction. In EMNLP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Peng and A. McCallum. Accurate Information Extraction from Research Papers Using Conditional Random Fields. In HLT-NAACL, 2004.Google ScholarGoogle Scholar
  27. F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An Algebraic Approach to Rule-Based Information Extraction. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Riloff. Automatically Constructing a Dictionary for Information Extraction Tasks. In KDD, 1993.Google ScholarGoogle Scholar
  29. W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward Best-Effort Information Extraction. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. G. Soderland. Learning Text Analysis Rules for Domain-specific Natural Language Processing. Technical report, U. Mass., 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop. Declarative Querying for Biological Sequences. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Thompson, M. Califf, and R. Mooney. Active Learning for Natural Language Parsing and Information Extraction. In ICML, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In CoNLL at HLT-NAACL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Zhao and R. Grishman. Extracting Relations with Integrated Information Using Kernel Methods. In ACL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic rule refinement for information extraction
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
      September 2010
      1658 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 September 2010
      Published in pvldb Volume 3, Issue 1-2

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader