Abstract
As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.
- E. Agichtein and S. Sarawagi. Scalable information extraction and integration. KDD, 2006.Google Scholar
- D. E. Appelt and B. Onyshkevych. The common pattern specification language. In TIPSTER workshop, 1998.Google ScholarCross Ref
- W. Cohen and A. McCallum. Information extraction from the World Wide Web. KDD, 2003.Google Scholar
- H. Cunningham, D. Maynard, and V. Tablan. JAPE: a java annotation patterns engine. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, 2000.Google Scholar
- A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. SIGMOD, 2006. Google ScholarDigital Library
- D. Freitag. Multistrategy learning for information extraction. In ICML, 1998. Google ScholarDigital Library
- Hadoop. http://hadoop.apache.org/.Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarDigital Library
- F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, 2004.Google Scholar
- F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008. Google ScholarDigital Library
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarDigital Library
- W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007. Google ScholarDigital Library
- System Text for Information Extraction. http://www.alphaworks.ibm.com/tech/systemt.Google Scholar
Index Terms
- SystemT: a system for declarative information extraction
Recommendations
The SystemT IDE: an integrated development environment for information extraction rules
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataInformation Extraction (IE)-the problem of extracting structured information from unstructured text - has become the key enabler for many enterprise applications such as semantic search, business analytics and regulatory compliance. While rule-based IE ...
SystemT: an algebraic approach to declarative information extraction
ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational LinguisticsAs information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and ...
SystemT: a declarative information extraction system
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems DemonstrationsEmerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses ...
Comments