Abstract
Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.
- Database languages -- SQL -- Part 1: Framework (SQL/Framework). Technical report. ISO/IEC 9075--1:2003.Google Scholar
- The Enron corpus. www.cs.cmu.edu/enron/.Google Scholar
- Automatic Content Extraction 2005 Evaluation Dataset, 2005.Google Scholar
- E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, 2000. Google ScholarDigital Library
- D. E. Appelt and B. Onyshkevych. The Common Pattern Specification Language. In TIPSTER workshop, 1998. Google ScholarDigital Library
- N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009. Google ScholarDigital Library
- J. L. Bentley. Programming Pearls: Algorithm Design Techniques. Commun. ACM, 27(9):865--873, 1984. Google ScholarDigital Library
- B. Boguraev. Annotation-based Finite State Processing in a Large-Scale NLP Architecture. In RANLP, 2003.Google Scholar
- A. Chapman and H. V. Jagadish. Why Not? In SIGMOD, 2009. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
- L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL, 2010. Google ScholarDigital Library
- H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06, University of Sheffield, May 1999.Google Scholar
- A. Das Sarma, A. Jain, and D. Srivastava. I4E: Interactive Investigation of Iterative Information Extraction. In SIGMOD, 2010. Google ScholarDigital Library
- D. DeJong. An Overview of the FRUMP System. In Strategies for Natural language Processing. 1982.Google Scholar
- D. Freitag. Multistrategy Learning for Information Extraction. In ICML, 1998. Google ScholarDigital Library
- B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, 2009. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance Semirings. In PODS, 2007. Google ScholarDigital Library
- M. Herschel and M. Hernandez. Explaining Missing Answers to SPJUA Queries. PVLDB, 2010. Google ScholarDigital Library
- J. R. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, and M. Tyson. FASTUS: a System for Extracting Information from Text. In HLT, 1993. Google ScholarDigital Library
- J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1), 2008. Google ScholarDigital Library
- A. Jain, P. Ipeirotis, A. Doan, and L. Gravano. Join Optimization of Information Extraction Output: Quality Matters! In ICDE, 2009. Google ScholarDigital Library
- R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a System for Declarative Information Extraction. SIGMOD Record, 37(4):7--13, 2008. Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001. Google ScholarDigital Library
- W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan, and S. Goldman. UMass/Hughes: Description of the CIRCUS System Used for MUC-5. In MUC, 1993. Google ScholarDigital Library
- Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular Expression Learning for Information Extraction. In EMNLP, 2008. Google ScholarDigital Library
- F. Peng and A. McCallum. Accurate Information Extraction from Research Papers Using Conditional Random Fields. In HLT-NAACL, 2004.Google Scholar
- F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An Algebraic Approach to Rule-Based Information Extraction. In ICDE, 2008. Google ScholarDigital Library
- E. Riloff. Automatically Constructing a Dictionary for Information Extraction Tasks. In KDD, 1993.Google Scholar
- W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward Best-Effort Information Extraction. In SIGMOD, 2008. Google ScholarDigital Library
- W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, 2007. Google ScholarDigital Library
- S. G. Soderland. Learning Text Analysis Rules for Domain-specific Natural Language Processing. Technical report, U. Mass., 1996. Google ScholarDigital Library
- S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop. Declarative Querying for Biological Sequences. In ICDE, 2006. Google ScholarDigital Library
- C. Thompson, M. Califf, and R. Mooney. Active Learning for Natural Language Parsing and Information Extraction. In ICML, 1999. Google ScholarDigital Library
- E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In CoNLL at HLT-NAACL, 2003. Google ScholarDigital Library
- A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), 2007. Google ScholarDigital Library
- S. Zhao and R. Grishman. Extracting Relations with Integrated Information Using Kernel Methods. In ACL, 2005. Google ScholarDigital Library
Index Terms
- Automatic rule refinement for information extraction
Recommendations
Applying Rule Extraction & Rule Refinement techniques to (Blackbox) Classifiers
K-CAP '15: Proceedings of the 8th International Conference on Knowledge CaptureBlack-box classifiers are able to classify unseen instances, once they have been trained on an appropriate (domain) dataset. Such classifiers have the advantage of being generally very efficient but the disadvantage of not being able to explain their ...
Refinement of rule sets with JoJo
ECML'93: Proceedings of the 6th European Conference on Machine LearningIn the paper we discuss a new approach for learning classification rules from examples. We sketch out the algorithm JoJo and its extension to a four step procedure which can be used to incrementally refine a set of classification rules. Incorrect rules ...
Uncertainty management in rule-based information extraction systems
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataRule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted ...
Comments