ABSTRACT
This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of "more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade effectiveness for efficiency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most efficient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large efficiency gains without significant loss of effectiveness.
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 51(1):117--122, 2008. Google ScholarDigital Library
- V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In SIGIR, 35--42, 2001. Google ScholarDigital Library
- R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 131--140, 2007. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 137--150, 2004. Google ScholarDigital Library
- T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce. In ACL, Companion Volume, 265--268, 2008. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In SOSP, 29--43, 2003. Google ScholarDigital Library
- W. Hersh, A. Cohen, J. Yang, R. Bhupatiraju, P. Roberts, and M. Hearst. TREC 2005 Genomics Track overview. In TREC, 2005.Google Scholar
- N. Lester, A. Moffat, W. Webber, and J. Zobel. Space-limited ranked query evaluation using adaptive pruning. In WISE, 470--477, 2005. Google ScholarDigital Library
- J. Lin and M. Smucker. How do users find things with PubMed? Towards automatic utility evaluation with user simulations. In SIGIR, 19--26, 2008. Google ScholarDigital Library
- J. Lin and W. J. Wilbur. PubMed related articles: A probabilistic topic-based model for content similarity. BMC Bioinformatics, 8:423, 2007.Google ScholarCross Ref
- A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM TOIS, 14(4):349--379, 1996. Google ScholarDigital Library
- M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996. Google ScholarCross Ref
- T. Strohman and W. Croft. Efficient document retrieval in main memory. In SIGIR, 175--182, 2007. Google ScholarDigital Library
- I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, 1999. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, 334--342, 2001. Google ScholarDigital Library
Index Terms
- Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Recommendations
Effective measures for inter-document similarity
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementWhile supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in ...
Efficient top-k similarity document search utilizing distributed file systems and cosine similarity
Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such ...
Comments