research-article

Efficient temporal keyword search over versioned text

Authors:
Avishek Anand

Max-Planck Institute for Informatics, Saarbrücken, Germany

Max-Planck Institute for Informatics, Saarbrücken, Germany
View Profile

,
Srikanta Bedathur

Max-Planck Institute for Informatics, Saarbrücken, Germany

Max-Planck Institute for Informatics, Saarbrücken, Germany
View Profile

,
Klaus Berberich

Max-Planck Institute for Informatics, Saarbrücken, Germany

Max-Planck Institute for Informatics, Saarbrücken, Germany
View Profile

,
Ralf Schenkel

Saarland University, Saarbrücken, Germany

Saarland University, Saarbrücken, Germany
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 699–708https://doi.org/10.1145/1871437.1871528

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 699–708

ABSTRACT

Modern text analytics applications operate on large volumes of temporal text data such as Web archives, newspaper archives, blogs, wikis, and micro-blogs. In these settings, searching and mining needs to use constraints on the time dimension in addition to keyword constraints. A natural approach to address such queries is using an inverted index whose entries are enriched with valid-time intervals. It has been shown that these indexes have to be partitioned along time in order to achieve efficiency. However, when the temporal predicate corresponds to a long time range, requiring the processing of multiple partitions, naive query processing incurs high cost of reading of redundant entries across partitions.

We present a framework for efficient approximate processing of keyword queries over a temporally partitioned inverted index which minimizes this overhead, thus speeding up query processing. By using a small synopsis for each partition we identify partitions that maximize the number of final non-redundant results, and schedule them for processing early on. Our approach aims to balance the estimated gains in the final result recall against the cost of index reading required. We present practical algorithms for the resulting optimization problem of index partition selection. Our experiments with three diverse, large-scale text archives reveal that our proposed approach can provide close to 80% result recall even when only about half the index is allowed to be read.

References

European archive. http://www.europarchive.org.Google Scholar
New york times annotated corpus. http://corpus.nytimes.com.Google Scholar
S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A Fast Decision Support Systems Using Approximate Query Answers. In VLDB, pages 754--757, 1999. Google ScholarDigital Library
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275--286, 1999. Google ScholarDigital Library
O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41(2):35--41, 2007. Google ScholarDigital Library
A. Anand, S. Bedathur, K. Berberich and Ralf Schenkel. Efficient Temporal Keyword Queries over Versioned Text. Technical Report MPI-I-2010-5-003, Max-Planck Institute for Informatics, 2010.Google Scholar
B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion b-tree. The VLDB Journal, 5(4), 1996. Google ScholarDigital Library
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text Search. In SIGIR, 2007. Google ScholarDigital Library
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. FluxCapacitor: Efficient Time-Travel Text Search. In Proc. of VLDB, 2007. Google ScholarDigital Library
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google ScholarDigital Library
D. Gao, C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Join operations in temporal databases. The VLDB Journal, 14(1):2--29, 2005. Google ScholarDigital Library
S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Inf. Process. Lett., 70(1):39--45, 1999. Google ScholarDigital Library
P. Muth, P. E. O'Neil, A. Pick, and G. Weikum. The LHAM Log-Structured History Data Access Method. VLDB J., 8(3-4):199--221, 2000. Google ScholarDigital Library
Wikipedia. http://en.wikipedia.org/.Google Scholar

Index Terms

Efficient temporal keyword search over versioned text
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
    1. Information retrieval query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Faster temporal range queries over versioned text
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often ...
Read More
Top-k temporal keyword search over social media data

Social media services have already become main sources for monitoring emerging topics and sensing real-life events. A social media platform manages social stream consisting of a huge volume of timestamped user generated data, including original data and ...
Read More
An index for efficient semantic full-text search
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

In this paper we present a novel index data structure tailored towards semantic full-text search. Semantic full-text search, as we call it, deeply integrates keyword-based full-text search with structured search in ontologies. Queries are SPARQL-like, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
partition selection
partitioned inverted index
synopses
time-travel search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 331
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient temporal keyword search over versioned text

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Faster temporal range queries over versioned text

Top-k temporal keyword search over social media data

An index for efficient semantic full-text search