research-article

SystemT: a system for declarative information extraction

Authors:
Rajasekar Krishnamurthy

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Yunyao Li

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Sriram Raghavan

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Frederick Reiss

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Shivakumar Vaithyanathan

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

,
Huaiyu Zhu

IBM Almaden Research Center

IBM Almaden Research Center
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 37 Issue 4December 2008pp 7–13https://doi.org/10.1145/1519103.1519105

Published:20 March 2009Publication History

ACM SIGMOD Record

Abstract

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

References

E. Agichtein and S. Sarawagi. Scalable information extraction and integration. KDD, 2006.Google Scholar
D. E. Appelt and B. Onyshkevych. The common pattern specification language. In TIPSTER workshop, 1998.Google ScholarCross Ref
W. Cohen and A. McCallum. Information extraction from the World Wide Web. KDD, 2003.Google Scholar
H. Cunningham, D. Maynard, and V. Tablan. JAPE: a java annotation patterns engine. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, 2000.Google Scholar
A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. SIGMOD, 2006. Google ScholarDigital Library
D. Freitag. Multistrategy learning for information extraction. In ICML, 1998. Google ScholarDigital Library
Hadoop. http://hadoop.apache.org/.Google Scholar
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarDigital Library
F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, 2004.Google Scholar
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008. Google ScholarDigital Library
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarDigital Library
W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007. Google ScholarDigital Library
System Text for Information Extraction. http://www.alphaworks.ibm.com/tech/systemt.Google Scholar

Index Terms

SystemT: a system for declarative information extraction
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

The SystemT IDE: an integrated development environment for information extraction rules
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Information Extraction (IE)-the problem of extracting structured information from unstructured text - has become the key enabler for many enterprise applications such as semantic search, business analytics and regulatory compliance. While rule-based IE ...
Read More
SystemT: an algebraic approach to declarative information extraction
ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and ...
Read More
SystemT: a declarative information extraction system
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations

Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 37, Issue 4
December 2008
116 pages
ISSN:0163-5808
DOI:10.1145/1519103
Issue’s Table of Contents

Copyright © 2009 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 March 2009
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 110
  Total Citations
  View Citations
- 837
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SystemT: a system for declarative information extraction

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

The SystemT IDE: an integrated development environment for information extraction rules

SystemT: an algebraic approach to declarative information extraction

SystemT: a declarative information extraction system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SystemT: a system for declarative information extraction

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

The SystemT IDE: an integrated development environment for information extraction rules

SystemT: an algebraic approach to declarative information extraction

SystemT: a declarative information extraction system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media