skip to main content
10.1145/1871437.1871528acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient temporal keyword search over versioned text

Authors Info & Claims
Published:26 October 2010Publication History

ABSTRACT

Modern text analytics applications operate on large volumes of temporal text data such as Web archives, newspaper archives, blogs, wikis, and micro-blogs. In these settings, searching and mining needs to use constraints on the time dimension in addition to keyword constraints. A natural approach to address such queries is using an inverted index whose entries are enriched with valid-time intervals. It has been shown that these indexes have to be partitioned along time in order to achieve efficiency. However, when the temporal predicate corresponds to a long time range, requiring the processing of multiple partitions, naive query processing incurs high cost of reading of redundant entries across partitions.

We present a framework for efficient approximate processing of keyword queries over a temporally partitioned inverted index which minimizes this overhead, thus speeding up query processing. By using a small synopsis for each partition we identify partitions that maximize the number of final non-redundant results, and schedule them for processing early on. Our approach aims to balance the estimated gains in the final result recall against the cost of index reading required. We present practical algorithms for the resulting optimization problem of index partition selection. Our experiments with three diverse, large-scale text archives reveal that our proposed approach can provide close to 80% result recall even when only about half the index is allowed to be read.

References

  1. European archive. http://www.europarchive.org.Google ScholarGoogle Scholar
  2. New york times annotated corpus. http://corpus.nytimes.com.Google ScholarGoogle Scholar
  3. S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A Fast Decision Support Systems Using Approximate Query Answers. In VLDB, pages 754--757, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275--286, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. O. Alonso, M. Gertz, and R. Baeza-Yates. On the value of temporal information in information retrieval. SIGIR Forum, 41(2):35--41, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Anand, S. Bedathur, K. Berberich and Ralf Schenkel. Efficient Temporal Keyword Queries over Versioned Text. Technical Report MPI-I-2010-5-003, Max-Planck Institute for Informatics, 2010.Google ScholarGoogle Scholar
  7. B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion b-tree. The VLDB Journal, 5(4), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A Time Machine for Text Search. In SIGIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. FluxCapacitor: Efficient Time-Travel Text Search. In Proc. of VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Gao, C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Join operations in temporal databases. The VLDB Journal, 14(1):2--29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Khuller, A. Moss, and J. S. Naor. The budgeted maximum coverage problem. Inf. Process. Lett., 70(1):39--45, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Muth, P. E. O'Neil, A. Pick, and G. Weikum. The LHAM Log-Structured History Data Access Method. VLDB J., 8(3-4):199--221, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wikipedia. http://en.wikipedia.org/.Google ScholarGoogle Scholar

Index Terms

  1. Efficient temporal keyword search over versioned text

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
          October 2010
          2036 pages
          ISBN:9781450300995
          DOI:10.1145/1871437

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader