research-article

Scalable techniques for document identifier assignment in inverted indexes

Authors:
Shuai Ding

Polytechnic Institute of NYU, NY, USA

Polytechnic Institute of NYU, NY, USA
View Profile

,
Josh Attenberg

Polytechnic Institute of NYU, NY, USA

Polytechnic Institute of NYU, NY, USA
View Profile

,
Torsten Suel

Polytechnic Institute of NYU, NY, USA

Polytechnic Institute of NYU, NY, USA
View Profile

WWW '10: Proceedings of the 19th international conference on World wide webApril 2010Pages 311–320https://doi.org/10.1145/1772690.1772723

Published:26 April 2010Publication History

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 311–320

ABSTRACT

Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size.

In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

References

V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151--166, Jan. 2005. Google ScholarDigital Library
A. Bagchi, A. Bhargava, and T. Suel. Approximate maximum weighted branchings. In Information Processing Letters, volume 99, 2006. Google ScholarDigital Library
E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely url-based topic classification. In 18th International World Wide Web Conference, April 2009. Google ScholarDigital Library
R. Blanco and A. Barreiro. Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In Proc. of the 28th annual int. ACM SIGIR conference on Research and development in inf. retrieval, 2005. Google ScholarDigital Library
R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005. Google ScholarDigital Library
R. Blanco and A. Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Inf. Retr., 9(4):499--517, 2006. Google ScholarDigital Library
D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. of the Data Compression Conference, pages 342--351, 2002. Google ScholarDigital Library
B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), May 2000. Google ScholarDigital Library
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. of the 30th Annual ACM Symp. on Theory of Computing, 1998. Google ScholarDigital Library
V. C. David L. Applegate, Robert E. Bixby and W. J. Cook. The traveling salesman problem: A computational study, 2006. Google ScholarDigital Library
J. Dean. Challenges in building large-scale information retrieval systems. In Second ACM International Conference on Web Search and Data Mining, 2009. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation, 2004. Google ScholarDigital Library
T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, Dallas, TX, 2000.Google Scholar
S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, 2005.Google Scholar
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. of the 30th ACM Symp. on Theory of Computing, 1998. Google ScholarDigital Library
D. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In 30th Int. Conf. on Very Large Data Bases(VLDB 2004), August 2004. Google ScholarDigital Library
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, 2000. Google ScholarDigital Library
Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of a collection of files. In Third Int. Conf. on Web Information Systems Engineering, 2002. Google ScholarDigital Library
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Inf Retrieval, pages 222--229, 2002. Google ScholarDigital Library
W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003. Google ScholarDigital Library
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007. Google ScholarDigital Library
F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf on Research and Development in Inf. Retrieval, 2004. Google ScholarDigital Library
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In 18th International World Wide Web Conference (WWW2009), April 2009. Google ScholarDigital Library
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conf, 2008. Google ScholarDigital Library
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. Google ScholarDigital Library
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006. Google ScholarDigital Library

Index Terms

Scalable techniques for document identifier assignment in inverted indexes
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Document identifier reassignment and run-length-compressed inverted indexes for improved search performance
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent ...
Read More
Inverted index compression and query processing with optimized document ordering
WWW '09: Proceedings of the 18th international conference on World wide web

Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first ...
Read More
Improved index compression techniques for versioned document collections
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '10: Proceedings of the 19th international conference on World wide web
April 2010
1407 pages
ISBN:9781605587998
DOI:10.1145/1772690
General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India
Copyright © 2010 International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
documentID reassignment
index compression
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 535
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ePub

View this article in ePub.

View ePub

Scalable techniques for document identifier assignment in inverted indexes

WWW '10: Proceedings of the 19th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Inverted index compression and query processing with optimized document ordering

Improved index compression techniques for versioned document collections