ABSTRACT
Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size.
In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.
- V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151--166, Jan. 2005. Google ScholarDigital Library
- A. Bagchi, A. Bhargava, and T. Suel. Approximate maximum weighted branchings. In Information Processing Letters, volume 99, 2006. Google ScholarDigital Library
- E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely url-based topic classification. In 18th International World Wide Web Conference, April 2009. Google ScholarDigital Library
- R. Blanco and A. Barreiro. Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In Proc. of the 28th annual int. ACM SIGIR conference on Research and development in inf. retrieval, 2005. Google ScholarDigital Library
- R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005. Google ScholarDigital Library
- R. Blanco and A. Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9(4):499--517, 2006. Google ScholarDigital Library
- D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. of the Data Compression Conference, pages 342--351, 2002. Google ScholarDigital Library
- B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), May 2000. Google ScholarDigital Library
- A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. of the 30th Annual ACM Symp. on Theory of Computing, 1998. Google ScholarDigital Library
- V. C. David L. Applegate, Robert E. Bixby and W. J. Cook. The traveling salesman problem: A computational study, 2006. Google ScholarDigital Library
- J. Dean. Challenges in building large-scale information retrieval systems. In Second ACM International Conference on Web Search and Data Mining, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation, 2004. Google ScholarDigital Library
- T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, Dallas, TX, 2000.Google Scholar
- S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, 2005.Google Scholar
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th ACM Symp. on Theory of Computing, 1998. Google ScholarDigital Library
- D. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In 30th Int. Conf. on Very Large Data Bases(VLDB 2004), August 2004. Google ScholarDigital Library
- A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, 2000. Google ScholarDigital Library
- Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of a collection of files. In Third Int. Conf. on Web Information Systems Engineering, 2002. Google ScholarDigital Library
- F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Inf Retrieval, pages 222--229, 2002. Google ScholarDigital Library
- W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003. Google ScholarDigital Library
- F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007. Google ScholarDigital Library
- F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf on Research and Development in Inf. Retrieval, 2004. Google ScholarDigital Library
- H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In 18th International World Wide Web Conference (WWW2009), April 2009. Google ScholarDigital Library
- J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conf, 2008. Google ScholarDigital Library
- J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. Google ScholarDigital Library
- M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006. Google ScholarDigital Library
Index Terms
- Scalable techniques for document identifier assignment in inverted indexes
Recommendations
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalText search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent ...
Inverted index compression and query processing with optimized document ordering
WWW '09: Proceedings of the 18th international conference on World wide webWeb search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first ...
Improved index compression techniques for versioned document collections
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementCurrent Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing ...
Comments