research-article

Automatic rule refinement for information extraction

Authors:
Bin Liu

University of Michigan

University of Michigan
View Profile

,
Laura Chiticariu

IBM Research - Almaden

IBM Research - Almaden
View Profile

,
Vivian Chu

IBM Research - Almaden

IBM Research - Almaden
View Profile

,
H. V. Jagadish

University of Michigan

University of Michigan
View Profile

,
Frederick R. Reiss

IBM Research - Almaden

IBM Research - Almaden
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 588–597https://doi.org/10.14778/1920841.1920916

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the SystemT information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.

References

Database languages -- SQL -- Part 1: Framework (SQL/Framework). Technical report. ISO/IEC 9075--1:2003.Google Scholar
The Enron corpus. www.cs.cmu.edu/enron/.Google Scholar
Automatic Content Extraction 2005 Evaluation Dataset, 2005.Google Scholar
E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, 2000. Google ScholarDigital Library
D. E. Appelt and B. Onyshkevych. The Common Pattern Specification Language. In TIPSTER workshop, 1998. Google ScholarDigital Library
N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009. Google ScholarDigital Library
J. L. Bentley. Programming Pearls: Algorithm Design Techniques. Commun. ACM, 27(9):865--873, 1984. Google ScholarDigital Library
B. Boguraev. Annotation-based Finite State Processing in a Large-Scale NLP Architecture. In RANLP, 2003.Google Scholar
A. Chapman and H. V. Jagadish. Why Not? In SIGMOD, 2009. Google ScholarDigital Library
J. Cheney, L. Chiticariu, and W. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL, 2010. Google ScholarDigital Library
H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06, University of Sheffield, May 1999.Google Scholar
A. Das Sarma, A. Jain, and D. Srivastava. I4E: Interactive Investigation of Iterative Information Extraction. In SIGMOD, 2010. Google ScholarDigital Library
D. DeJong. An Overview of the FRUMP System. In Strategies for Natural language Processing. 1982.Google Scholar
D. Freitag. Multistrategy Learning for Information Extraction. In ICML, 1998. Google ScholarDigital Library
B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, 2009. Google ScholarDigital Library
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance Semirings. In PODS, 2007. Google ScholarDigital Library
M. Herschel and M. Hernandez. Explaining Missing Answers to SPJUA Queries. PVLDB, 2010. Google ScholarDigital Library
J. R. Hobbs, D. Appelt, J. Bear, D. Israel, M. Kameyama, and M. Tyson. FASTUS: a System for Extracting Information from Text. In HLT, 1993. Google ScholarDigital Library
J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1), 2008. Google ScholarDigital Library
A. Jain, P. Ipeirotis, A. Doan, and L. Gravano. Join Optimization of Information Extraction Output: Quality Matters! In ICDE, 2009. Google ScholarDigital Library
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a System for Declarative Information Extraction. SIGMOD Record, 37(4):7--13, 2008. Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001. Google ScholarDigital Library
W. Lehnert, J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan, and S. Goldman. UMass/Hughes: Description of the CIRCUS System Used for MUC-5. In MUC, 1993. Google ScholarDigital Library
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular Expression Learning for Information Extraction. In EMNLP, 2008. Google ScholarDigital Library
F. Peng and A. McCallum. Accurate Information Extraction from Research Papers Using Conditional Random Fields. In HLT-NAACL, 2004.Google Scholar
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An Algebraic Approach to Rule-Based Information Extraction. In ICDE, 2008. Google ScholarDigital Library
E. Riloff. Automatically Constructing a Dictionary for Information Extraction Tasks. In KDD, 1993.Google Scholar
W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward Best-Effort Information Extraction. In SIGMOD, 2008. Google ScholarDigital Library
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, 2007. Google ScholarDigital Library
S. G. Soderland. Learning Text Analysis Rules for Domain-specific Natural Language Processing. Technical report, U. Mass., 1996. Google ScholarDigital Library
S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop. Declarative Querying for Biological Sequences. In ICDE, 2006. Google ScholarDigital Library
C. Thompson, M. Califf, and R. Mooney. Active Learning for Natural Language Parsing and Information Extraction. In ICML, 1999. Google ScholarDigital Library
E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In CoNLL at HLT-NAACL, 2003. Google ScholarDigital Library
A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), 2007. Google ScholarDigital Library
S. Zhao and R. Grishman. Extracting Relations with Integrated Information Using Kernel Methods. In ACL, 2005. Google ScholarDigital Library

Index Terms

Automatic rule refinement for information extraction
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Applying Rule Extraction & Rule Refinement techniques to (Blackbox) Classifiers
K-CAP '15: Proceedings of the 8th International Conference on Knowledge Capture

Black-box classifiers are able to classify unseen instances, once they have been trained on an appropriate (domain) dataset. Such classifiers have the advantage of being generally very efficient but the disadvantage of not being able to explain their ...
Read More
Refinement of rule sets with JoJo
ECML'93: Proceedings of the 6th European Conference on Machine Learning

In the paper we discuss a new approach for learning classification rules from examples. We sketch out the algorithm JoJo and its extension to a four step procedure which can be used to incrementally refine a set of classification rules. Incorrect rules ...
Read More
Uncertainty management in rule-based information extraction systems
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Rule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 381
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Applying Rule Extraction & Rule Refinement techniques to (Blackbox) Classifiers

Refinement of rule sets with JoJo

Uncertainty management in rule-based information extraction systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Applying Rule Extraction & Rule Refinement techniques to (Blackbox) Classifiers

Refinement of rule sets with JoJo

Uncertainty management in rule-based information extraction systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media