skip to main content
10.1145/1772690.1772723acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Scalable techniques for document identifier assignment in inverted indexes

Published:26 April 2010Publication History

ABSTRACT

Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size.

In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

References

  1. V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151--166, Jan. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bagchi, A. Bhargava, and T. Suel. Approximate maximum weighted branchings. In Information Processing Letters, volume 99, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely url-based topic classification. In 18th International World Wide Web Conference, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Blanco and A. Barreiro. Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In Proc. of the 28th annual int. ACM SIGIR conference on Research and development in inf. retrieval, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Blanco and A. Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9(4):499--517, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. of the Data Compression Conference, pages 342--351, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. of the 30th Annual ACM Symp. on Theory of Computing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. V. C. David L. Applegate, Robert E. Bixby and W. J. Cook. The traveling salesman problem: A computational study, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean. Challenges in building large-scale information retrieval systems. In Second ACM International Conference on Web Search and Data Mining, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, Dallas, TX, 2000.Google ScholarGoogle Scholar
  14. S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, 2005.Google ScholarGoogle Scholar
  15. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th ACM Symp. on Theory of Computing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In 30th Int. Conf. on Very Large Data Bases(VLDB 2004), August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of a collection of files. In Third Int. Conf. on Web Information Systems Engineering, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Inf Retrieval, pages 222--229, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf on Research and Development in Inf. Retrieval, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In 18th International World Wide Web Conference (WWW2009), April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conf, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable techniques for document identifier assignment in inverted indexes

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '10: Proceedings of the 19th international conference on World wide web
        April 2010
        1407 pages
        ISBN:9781605587998
        DOI:10.1145/1772690

        Copyright © 2010 International World Wide Web Conference Committee (IW3C2)

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 April 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      ePub

      View this article in ePub.

      View ePub