skip to main content
research-article

SystemT: a system for declarative information extraction

Published:20 March 2009Publication History
Skip Abstract Section

Abstract

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

References

  1. E. Agichtein and S. Sarawagi. Scalable information extraction and integration. KDD, 2006.Google ScholarGoogle Scholar
  2. D. E. Appelt and B. Onyshkevych. The common pattern specification language. In TIPSTER workshop, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  3. W. Cohen and A. McCallum. Information extraction from the World Wide Web. KDD, 2003.Google ScholarGoogle Scholar
  4. H. Cunningham, D. Maynard, and V. Tablan. JAPE: a java annotation patterns engine. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, 2000.Google ScholarGoogle Scholar
  5. A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Freitag. Multistrategy learning for information extraction. In ICML, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  8. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, 2004.Google ScholarGoogle Scholar
  10. F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. System Text for Information Extraction. http://www.alphaworks.ibm.com/tech/systemt.Google ScholarGoogle Scholar

Index Terms

  1. SystemT: a system for declarative information extraction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGMOD Record
          ACM SIGMOD Record  Volume 37, Issue 4
          December 2008
          116 pages
          ISSN:0163-5808
          DOI:10.1145/1519103
          Issue’s Table of Contents

          Copyright © 2009 Authors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 March 2009

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader