ABSTRACT
Relation extraction transforms the textual representation of a relationship into the relational model of a data warehouse. Early systems, such as SystemT by IBM or the open source system GATE solve this task with handcrafted rule sets that the system executes document-by-document. Thereby the user must execute a highly interactive and iterative process of reading a document, of expressing rules, of testing these rules on the next document and of refining rules. Until now, these systems do neither leverage the full potential of built-in declarative query languages nor the indexing and query optimization techniques of a modern RDBMS that would enable a user interactive rule refinement across documents and on the entire corpus. We propose the INDREX system that enables a user for the first time to describe corpus-wide extraction tasks in a declarative language and permits the user to run interactive rule refinement queries. For enabling this powerful functionality we extend a standard PostgreSQL with a set of white-box user-defined functions that enable corpus-wide transformations from sentences into relationships. We store the text corpus and rules in the same RDBMS that already holds domain specific structured data. As a result, (1) the user can leverage this data to further adapt rules to the target domain, (2) the user does not need an additional system for rule extraction and (3) the INDREX system can leverage the full power of built-in indexing and query optimization techniques of the underlaying RDBMS. In a preliminary study we report on the feasibility of this disruptive approach and show multiple queries in INDREX on the Reuters Corpus, Volume 1.
- A. Akbik and A. Löser. Kraken: N-ary facts in open information extraction. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, AKBC-WEKEX '12, pages 52--56, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. Google ScholarDigital Library
- A. Akbik, L. Visengeriyeva, P. Herger, H. Hemsen, and A. Löser. Unsupervised discovery of relations and discriminative extraction patterns. In COLING, pages 17--32, 2012.Google Scholar
- A. Akbik, L. Visengeriyeva, and J. K. A. Löser. Effective selectional restrictions for unsupervised relation extraction. In IJCNLP, 2013.Google Scholar
- J. F. Allen. Maintaining knowledge about temporal intervals. Commun. ACM, 26(11):832--843, Nov. 1983. Google ScholarDigital Library
- M. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Ré, and C. Zhang. Brainwash: A data system for feature engineering. In CIDR, 2013.Google Scholar
- G. Attardi, F. dell'Orletta, M. Simi, A. Chanev, and M. Ciaramita. Multilingual dependency parsing and domain adaptation using desr. In EMNLP-CoNLL, pages 1112--1118, 2007.Google Scholar
- N. Bales, A. Deutsch, and V. Vassalos. Score-consistent algebraic optimization of full-text search queries with graft. In SIGMOD Conference, pages 769--780, 2011. Google ScholarDigital Library
- B. Bloom. Taxonomy of educational objectives: Handbook I: Cognitive Domain. New York, Longmans, Green 1956.Google Scholar
- J.-H. Boese, C. Tosun, C. Mathis, and F. Faerber. Data management with saps in-memory computing engine. In EDBT, pages 542--544, 2012. Google ScholarDigital Library
- F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, pages 943--952, 2008. Google ScholarDigital Library
- L. Chiticariu, V. Chu, S. Dasgupta, T. W. Goetz, H. Ho, R. Krishnamurthy, A. Lang, Y. Li, B. Liu, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. The systemt ide: an integrated development environment for information extraction rules. In SIGMOD Conference, pages 1291--1294, 2011. Google ScholarDigital Library
- E. F. Codd. Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4:397--434, 1979. Google ScholarDigital Library
- L. D. Corro and R. Gemulla. Clausie: clause-based open information extraction. In WWW, pages 355--366, 2013. Google ScholarDigital Library
- A. El-Helw, M. H. Farid, and I. F. Ilyas. Just-in-time information extraction using extraction views. In SIGMOD Conference, pages 613--616, 2012. Google ScholarDigital Library
- O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. Open information extraction: The second generation. In IJCAI, pages 3--10, 2011. Google ScholarDigital Library
- A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, page 1535--1545, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarDigital Library
- D. Ferrucci and A. Lally. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4):327--348, Sept. 2004. Google ScholarDigital Library
- A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7):419--429, 2011. Google ScholarDigital Library
- K. Ganchev, K. Hall, R. T. McDonald, and S. Petrov. Using search-logs to improve query tagging. In ACL (2), pages 238--242, 2012. Google ScholarDigital Library
- M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING, pages 539--545, 1992. Google ScholarDigital Library
- A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In Data Engineering, International Conference on, volume 0, pages 636--645, Los Alamitos, CA, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Rec., 37(4):7--13, Mar. 2009. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, Dec. 2004. Google ScholarDigital Library
- A. Löser, S. Arnold, and T. Fiehn. The goolap fact retrieval framework. In Business Intelligence, pages 84--97. Springer, 2012.Google ScholarCross Ref
- A. Löser, C. Nagel, S. Pieper, and C. Boden. Beyond search: Retrieving complete tuples from a text-database. Information Systems Frontiers, 15(3):311--329, 2013. Google ScholarDigital Library
- G. Marchionini. Exploratory search: from finding to understanding. Commun. ACM, 49(4):41--46, Apr. 2006. Google ScholarDigital Library
- N. Nakashole, G. Weikum, and F. M. Suchanek. Patty: A taxonomy of relational patterns with semantic types. In EMNLP-CoNLL, pages 1135--1145, 2012. Google ScholarDigital Library
- D. E. Rose and D. Levinson. Understanding user goals in web search. In WWW, pages 13--19, 2004. Google ScholarDigital Library
- F. M. Suchanek, M. Sozio, and G. Weikum. Sofie: a self-organizing framework for information extraction. In WWW, pages 631--640, 2009. Google ScholarDigital Library
- A. Sun and R. Grishman. Active learning for relation type extension with local and global data views. In CIKM, pages 1105--1112, 2012. Google ScholarDigital Library
- L. Tari, P. H. Tu, J. Hakenberg, Y. Chen, T. C. Son, G. Gonzalez, and C. Baral. Incremental information extraction using relational databases. IEEE Trans. Knowl. Data Eng., 24(1):86--99, 2012. Google ScholarDigital Library
Index Terms
- INDREX: in-database distributional relation extraction
Recommendations
Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologiesNamed entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical InformaticsDue to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Comments