ABSTRACT
Modern text analytics applications operate on large volumes of temporal text data such as Web archives, newspaper archives, blogs, wikis, and micro-blogs. In these settings, searching and mining needs to use constraints on the time dimension in addition to keyword constraints. A natural approach to address such queries is using an inverted index whose entries are enriched with valid-time intervals. It has been shown that these indexes have to be partitioned along time in order to achieve efficiency. However, when the temporal predicate corresponds to a long time range, requiring the processing of multiple partitions, naive query processing incurs high cost of reading of redundant entries across partitions.
We present a framework for efficient approximate processing of keyword queries over a temporally partitioned inverted index which minimizes this overhead, thus speeding up query processing. By using a small synopsis for each partition we identify partitions that maximize the number of final non-redundant results, and schedule them for processing early on. Our approach aims to balance the estimated gains in the final result recall against the cost of index reading required. We present practical algorithms for the resulting optimization problem of index partition selection. Our experiments with three diverse, large-scale text archives reveal that our proposed approach can provide close to 80% result recall even when only about half the index is allowed to be read.
- European archive. http://www.europarchive.org.Google Scholar
- New york times annotated corpus. http://corpus.nytimes.com.Google Scholar
- S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A Fast Decision Support Systems Using Approximate Query Answers. In VLDB, pages 754--757, 1999. Google ScholarDigital Library
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275--286, 1999. Google ScholarDigital Library
- O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41(2):35--41, 2007. Google ScholarDigital Library
- A. Anand, S. Bedathur, K. Berberich and Ralf Schenkel. Efficient Temporal Keyword Queries over Versioned Text. Technical Report MPI-I-2010-5-003, Max-Planck Institute for Informatics, 2010.Google Scholar
- B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion b-tree. The VLDB Journal, 5(4), 1996. Google ScholarDigital Library
- K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text Search. In SIGIR, 2007. Google ScholarDigital Library
- K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. FluxCapacitor: Efficient Time-Travel Text Search. In Proc. of VLDB, 2007. Google ScholarDigital Library
- K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google ScholarDigital Library
- D. Gao, C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Join operations in temporal databases. The VLDB Journal, 14(1):2--29, 2005. Google ScholarDigital Library
- S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Inf. Process. Lett., 70(1):39--45, 1999. Google ScholarDigital Library
- P. Muth, P. E. O'Neil, A. Pick, and G. Weikum. The LHAM Log-Structured History Data Access Method. VLDB J., 8(3-4):199--221, 2000. Google ScholarDigital Library
- Wikipedia. http://en.wikipedia.org/.Google Scholar
Index Terms
- Efficient temporal keyword search over versioned text
Recommendations
Faster temporal range queries over versioned text
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalVersioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often ...
Top-k temporal keyword search over social media data
Social media services have already become main sources for monitoring emerging topics and sensing real-life events. A social media platform manages social stream consisting of a huge volume of timestamped user generated data, including original data and ...
An index for efficient semantic full-text search
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementIn this paper we present a novel index data structure tailored towards semantic full-text search. Semantic full-text search, as we call it, deeply integrates keyword-based full-text search with structured search in ontologies. Queries are SPARQL-like, ...
Comments