research-article

INDREX: in-database distributional relation extraction

Authors:
Torsten Kilias

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Alexander Löser

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Periklis Andritsos

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAPOctober 2013Pages 93–100https://doi.org/10.1145/2513190.2513196

Published:28 October 2013Publication History

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Pages 93–100

ABSTRACT

Relation extraction transforms the textual representation of a relationship into the relational model of a data warehouse. Early systems, such as SystemT by IBM or the open source system GATE solve this task with handcrafted rule sets that the system executes document-by-document. Thereby the user must execute a highly interactive and iterative process of reading a document, of expressing rules, of testing these rules on the next document and of refining rules. Until now, these systems do neither leverage the full potential of built-in declarative query languages nor the indexing and query optimization techniques of a modern RDBMS that would enable a user interactive rule refinement across documents and on the entire corpus. We propose the INDREX system that enables a user for the first time to describe corpus-wide extraction tasks in a declarative language and permits the user to run interactive rule refinement queries. For enabling this powerful functionality we extend a standard PostgreSQL with a set of white-box user-defined functions that enable corpus-wide transformations from sentences into relationships. We store the text corpus and rules in the same RDBMS that already holds domain specific structured data. As a result, (1) the user can leverage this data to further adapt rules to the target domain, (2) the user does not need an additional system for rule extraction and (3) the INDREX system can leverage the full power of built-in indexing and query optimization techniques of the underlaying RDBMS. In a preliminary study we report on the feasibility of this disruptive approach and show multiple queries in INDREX on the Reuters Corpus, Volume 1.

References

A. Akbik and A. Löser. Kraken: N-ary facts in open information extraction. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, AKBC-WEKEX '12, pages 52--56, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
A. Akbik, L. Visengeriyeva, P. Herger, H. Hemsen, and A. Löser. Unsupervised discovery of relations and discriminative extraction patterns. In COLING, pages 17--32, 2012.Google Scholar
A. Akbik, L. Visengeriyeva, and J. K. A. Löser. Effective selectional restrictions for unsupervised relation extraction. In IJCNLP, 2013.Google Scholar
J. F. Allen. Maintaining knowledge about temporal intervals. Commun. ACM, 26(11):832--843, Nov. 1983. Google ScholarDigital Library
M. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Ré, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR, 2013.Google Scholar
G. Attardi, F. dell'Orletta, M. Simi, A. Chanev, and M. Ciaramita. Multilingual dependency parsing and domain adaptation using desr. In EMNLP-CoNLL, pages 1112--1118, 2007.Google Scholar
N. Bales, A. Deutsch, and V. Vassalos. Score-consistent algebraic optimization of full-text search queries with graft. In SIGMOD Conference, pages 769--780, 2011. Google ScholarDigital Library
B. Bloom. Taxonomy of educational objectives: Handbook I: Cognitive Domain. New York, Longmans, Green 1956.Google Scholar
J.-H. Boese, C. Tosun, C. Mathis, and F. Faerber. Data management with saps in-memory computing engine. In EDBT, pages 542--544, 2012. Google ScholarDigital Library
F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, pages 943--952, 2008. Google ScholarDigital Library
L. Chiticariu, V. Chu, S. Dasgupta, T. W. Goetz, H. Ho, R. Krishnamurthy, A. Lang, Y. Li, B. Liu, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. The systemt ide: an integrated development environment for information extraction rules. In SIGMOD Conference, pages 1291--1294, 2011. Google ScholarDigital Library
E. F. Codd. Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4:397--434, 1979. Google ScholarDigital Library
L. D. Corro and R. Gemulla. Clausie: clause-based open information extraction. In WWW, pages 355--366, 2013. Google ScholarDigital Library
A. El-Helw, M. H. Farid, and I. F. Ilyas. Just-in-time information extraction using extraction views. In SIGMOD Conference, pages 613--616, 2012. Google ScholarDigital Library
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. Open information extraction: The second generation. In IJCAI, pages 3--10, 2011. Google ScholarDigital Library
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, page 1535--1545, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarDigital Library
D. Ferrucci and A. Lally. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4):327--348, Sept. 2004. Google ScholarDigital Library
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7):419--429, 2011. Google ScholarDigital Library
K. Ganchev, K. Hall, R. T. McDonald, and S. Petrov. Using search-logs to improve query tagging. In ACL (2), pages 238--242, 2012. Google ScholarDigital Library
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING, pages 539--545, 1992. Google ScholarDigital Library
A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In Data Engineering, International Conference on, volume 0, pages 636--645, Los Alamitos, CA, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Rec., 37(4):7--13, Mar. 2009. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, Dec. 2004. Google ScholarDigital Library
A. Löser, S. Arnold, and T. Fiehn. The goolap fact retrieval framework. In Business Intelligence, pages 84--97. Springer, 2012.Google ScholarCross Ref
A. Löser, C. Nagel, S. Pieper, and C. Boden. Beyond search: Retrieving complete tuples from a text-database. Information Systems Frontiers, 15(3):311--329, 2013. Google ScholarDigital Library
G. Marchionini. Exploratory search: from finding to understanding. Commun. ACM, 49(4):41--46, Apr. 2006. Google ScholarDigital Library
N. Nakashole, G. Weikum, and F. M. Suchanek. Patty: A taxonomy of relational patterns with semantic types. In EMNLP-CoNLL, pages 1135--1145, 2012. Google ScholarDigital Library
D. E. Rose and D. Levinson. Understanding user goals in web search. In WWW, pages 13--19, 2004. Google ScholarDigital Library
F. M. Suchanek, M. Sozio, and G. Weikum. Sofie: a self-organizing framework for information extraction. In WWW, pages 631--640, 2009. Google ScholarDigital Library
A. Sun and R. Grishman. Active learning for relation type extension with local and global data views. In CIKM, pages 1105--1112, 2012. Google ScholarDigital Library
L. Tari, P. H. Tu, J. Hakenberg, Y. Chen, T. C. Son, G. Gonzalez, and C. Baral. Incremental information extraction using relational databases. IEEE Trans. Knowl. Data Eng., 24(1):86--99, 2012. Google ScholarDigital Library

Index Terms

INDREX: in-database distributional relation extraction
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies

Named entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics

Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP
October 2013
110 pages
ISBN:9781450324120
DOI:10.1145/2513190
General Chair:
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Ladjel Bellatreche
ISAE-ENSMA, France
,
Alfredo Cuzzocrea
ICAR-CNR and University of Calabria, Italy
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ad-hoc reports from text data
information extraction
iterative etl
managing text data in a rdbms
Qualifiers
- research-article
Conference

Acceptance Rates
DOLAP '13 Paper Acceptance Rate13of26submissions,50%Overall Acceptance Rate29of79submissions,37%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 193
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

INDREX: in-database distributional relation extraction

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing

Two-stage approach to named entity recognition using Wikipedia and DBpedia

A Flexible Text Mining System for Entity and Relation Extraction in PubMed

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

INDREX: in-database distributional relation extraction

DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing

Two-stage approach to named entity recognition using Wikipedia and DBpedia

A Flexible Text Mining System for Entity and Relation Extraction in PubMed

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media