research-article

Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Author:
Jimmy Lin

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 155–162https://doi.org/10.1145/1571941.1571970

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 155–162

ABSTRACT

This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of "more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade effectiveness for efficiency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most efficient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large efficiency gains without significant loss of effectiveness.

References

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 51(1):117--122, 2008. Google ScholarDigital Library
V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In SIGIR, 35--42, 2001. Google ScholarDigital Library
R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 131--140, 2007. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 137--150, 2004. Google ScholarDigital Library
T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce. In ACL, Companion Volume, 265--268, 2008. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SOSP, 29--43, 2003. Google ScholarDigital Library
W. Hersh, A. Cohen, J. Yang, R. Bhupatiraju, P. Roberts, and M. Hearst. TREC 2005 Genomics Track overview. In TREC, 2005.Google Scholar
N. Lester, A. Moffat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In WISE, 470--477, 2005. Google ScholarDigital Library
J. Lin and M. Smucker. How do users find things with PubMed? Towards automatic utility evaluation with user simulations. In SIGIR, 19--26, 2008. Google ScholarDigital Library
J. Lin and W. J. Wilbur. PubMed related articles: A probabilistic topic-based model for content similarity. BMC Bioinformatics, 8:423, 2007.Google ScholarCross Ref
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM TOIS, 14(4):349--379, 1996. Google ScholarDigital Library
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996. Google ScholarCross Ref
T. Strohman and W. Croft. Efficient document retrieval in main memory. In SIGIR, 175--182, 2007. Google ScholarDigital Library
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, 1999. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, 334--342, 2001. Google ScholarDigital Library

Index Terms

Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
1. Information systems
  1. Information retrieval

Recommendations

Effective measures for inter-document similarity
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in ...
Read More
Investigating Measures for Pairwise Document Similarity
Read More
Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed algorithms
hadoop
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 61
  Total Citations
  View Citations
- 1,344
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective measures for inter-document similarity

Investigating Measures for Pairwise Document Similarity

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity