ABSTRACT
Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into <small>RECAP</small>, a prototype information flow analysis tool. Our experimental results with <small>RECAP</small> indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.
- J. Allan, A. Bolivar, and C. Wade. Retrieval and novelty detection at the sentence level. In Proc. 26th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 314--321, 2003.]] Google ScholarDigital Library
- J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218, 1998.]]Google Scholar
- A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. 22nd Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, 1999.]] Google ScholarDigital Library
- Y. Bernstein and J. Zobel. A scalable system for identifying coderivative documents. In Proc. String Processing and Information Retrieval Symp., pages 55--67, 2004. Published as LNCS 3246.]]Google ScholarCross Ref
- S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Ann. Conf., pages 398--409, 1995.]] Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.]] Google ScholarDigital Library
- P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, 1993.]] Google ScholarDigital Library
- D. Harman. Overview of the TREC 2002 novelty track. In Proc. 11th Text REtrieval Conf. (TREC 2002). NIST, 2002.]]Google Scholar
- N. Heintze. Scalable document fingerprinting. In Proc. USENIX Workshop on Electronic Commerce, November 1996.]]Google Scholar
- T. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. Journal of the American Society of Information Science and Technology, 54(3):203--215, 2003.]] Google ScholarDigital Library
- U. Manber. Finding similar files in a large file system. In Proc. USENIX Winter Technical Conf., pages 1--10, San Fransisco, CA, USA, 17--21 1994.]] Google ScholarDigital Library
- D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. The RECAP system for identifying information flow. In Proc. 28th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, Aug. 2005. Demonstration abstract, to appear.]] Google ScholarDigital Library
- D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at terabyte track 2004. In Proc. 13th Text REtrieval Conf. (TREC 2004). NIST, 2004.]]Google Scholar
- V. Murdock and W. B. Croft. Simple translation models for sentence retrieval in factoid question answering. In Proc. SIGIR Workshop on Information Retrieval for Question Answering, pages 31--35, 2004.]]Google Scholar
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. 21st Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 275--281, 1998.]] Google ScholarDigital Library
- S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Proc. 1st Text REtrieval Conf. (TREC 2001), pages 21--30. NIST, 1992.]]Google Scholar
- M. Sanderson. Duplicate detection in the Reuters collection. Technical Report TR-1997-5, University of Glasgow, 1997.]]Google Scholar
- N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. 2nd Conf. on the Theory and Practice of Digital Libraries, 1995.]]Google Scholar
- I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proc. 12th Text REtrieval Conf. (TREC 2003), pages 38--53. NIST, 2003.]]Google Scholar
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad-hoc information retrieval. In Proc. 24th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 334--342, 2001.]] Google ScholarDigital Library
Index Terms
- Similarity measures for tracking information flow
Recommendations
The Bayes Decision Rule Induced Similarity Measures
This paper first shows that the popular whitened cosine similarity measure is related to the Bayes decision rule under specific assumptions and then presents two new similarity measures: the PRM Whitened Cosine (PWC) similarity measure and the Within-...
Strong similarity measures for ordered sets of documents in information retrieval
A general method is presented to construct ordered similarity measures (OS-measures), i.e., similarity measures for ordered sets of documents (as, e.g., being the result of an IR-process), based on classical, well-known similarity measures for ordinary ...
Query-sensitive similarity measures for information retrieval
The application of document clustering to information retrieval has been motivated by the potential effectiveness gains postulated by the cluster hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other and ...
Comments