Article

Similarity measures for tracking information flow

Authors:
Donald Metzler

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

,
Yaniv Bernstein

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
W. Bruce Croft

University of Massachusetts, Amherst, MA

University of Massachusetts, Amherst, MA
View Profile

,
Alistair Moffat

University of Melbourne, Melbourne, Australia

University of Melbourne, Melbourne, Australia
View Profile

,
Justin Zobel

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementOctober 2005Pages 517–524https://doi.org/10.1145/1099554.1099695

Published:31 October 2005Publication History

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Pages 517–524

ABSTRACT

Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into <small>RECAP</small>, a prototype information flow analysis tool. Our experimental results with <small>RECAP</small> indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.

References

J. Allan, A. Bolivar, and C. Wade. Retrieval and novelty detection at the sentence level. In Proc. 26th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 314--321, 2003.]] Google ScholarDigital Library
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218, 1998.]]Google Scholar
A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. 22nd Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, 1999.]] Google ScholarDigital Library
Y. Bernstein and J. Zobel. A scalable system for identifying coderivative documents. In Proc. String Processing and Information Retrieval Symp., pages 55--67, 2004. Published as LNCS 3246.]]Google ScholarCross Ref
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Ann. Conf., pages 398--409, 1995.]] Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.]] Google ScholarDigital Library
P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, 1993.]] Google ScholarDigital Library
D. Harman. Overview of the TREC 2002 novelty track. In Proc. 11th Text REtrieval Conf. (TREC 2002). NIST, 2002.]]Google Scholar
N. Heintze. Scalable document fingerprinting. In Proc. USENIX Workshop on Electronic Commerce, November 1996.]]Google Scholar
T. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. Journal of the American Society of Information Science and Technology, 54(3):203--215, 2003.]] Google ScholarDigital Library
U. Manber. Finding similar files in a large file system. In Proc. USENIX Winter Technical Conf., pages 1--10, San Fransisco, CA, USA, 17--21 1994.]] Google ScholarDigital Library
D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. The RECAP system for identifying information flow. In Proc. 28th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, Aug. 2005. Demonstration abstract, to appear.]] Google ScholarDigital Library
D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at terabyte track 2004. In Proc. 13th Text REtrieval Conf. (TREC 2004). NIST, 2004.]]Google Scholar
V. Murdock and W. B. Croft. Simple translation models for sentence retrieval in factoid question answering. In Proc. SIGIR Workshop on Information Retrieval for Question Answering, pages 31--35, 2004.]]Google Scholar
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. 21st Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 275--281, 1998.]] Google ScholarDigital Library
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Proc. 1st Text REtrieval Conf. (TREC 2001), pages 21--30. NIST, 1992.]]Google Scholar
M. Sanderson. Duplicate detection in the Reuters collection. Technical Report TR-1997-5, University of Glasgow, 1997.]]Google Scholar
N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. 2nd Conf. on the Theory and Practice of Digital Libraries, 1995.]]Google Scholar
I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proc. 12th Text REtrieval Conf. (TREC 2003), pages 38--53. NIST, 2003.]]Google Scholar
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad-hoc information retrieval. In Proc. 24th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 334--342, 2001.]] Google ScholarDigital Library

Index Terms

Similarity measures for tracking information flow
1. Information systems
  1. Information retrieval

Recommendations

The Bayes Decision Rule Induced Similarity Measures

This paper first shows that the popular whitened cosine similarity measure is related to the Bayes decision rule under specific assumptions and then presents two new similarity measures: the PRM Whitened Cosine (PWC) similarity measure and the Within-...
Read More
Strong similarity measures for ordered sets of documents in information retrieval

A general method is presented to construct ordered similarity measures (OS-measures), i.e., similarity measures for ordered sets of documents (as, e.g., being the result of an IR-process), based on classical, well-known similarity measures for ordinary ...
Read More
Query-sensitive similarity measures for information retrieval

The application of document clustering to information retrieval has been motivated by the potential effectiveness gains postulated by the cluster hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
October 2005
854 pages
ISBN:1595931406
DOI:10.1145/1099554
General Chair:
Otthein Herzog
University of Bremen, Germany
,
Program Chairs:
Hans-Jörg Schek
University for Health Sciences, Medical Informatics and Technology, Austria
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Abdur Chowdhury
America Online, USA
,
Wilfried Teiken
IBM T.J. Watson Research Center, USA
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information flow
statistical translation
text reuse
Qualifiers
- Article
Conference

Acceptance Rates
CIKM '05 Paper Acceptance Rate77of425submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 93
  Total Citations
  View Citations
- 900
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Similarity measures for tracking information flow

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Bayes Decision Rule Induced Similarity Measures

Strong similarity measures for ordered sets of documents in information retrieval

Query-sensitive similarity measures for information retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Similarity measures for tracking information flow

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Bayes Decision Rule Induced Similarity Measures

Strong similarity measures for ordered sets of documents in information retrieval

Query-sensitive similarity measures for information retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media