skip to main content
10.1145/1099554.1099695acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Similarity measures for tracking information flow

Published:31 October 2005Publication History

ABSTRACT

Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into <small>RECAP</small>, a prototype information flow analysis tool. Our experimental results with <small>RECAP</small> indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.

References

  1. J. Allan, A. Bolivar, and C. Wade. Retrieval and novelty detection at the sentence level. In Proc. 26th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 314--321, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218, 1998.]]Google ScholarGoogle Scholar
  3. A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. 22nd Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Bernstein and J. Zobel. A scalable system for identifying coderivative documents. In Proc. String Processing and Information Retrieval Symp., pages 55--67, 2004. Published as LNCS 3246.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Ann. Conf., pages 398--409, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13):1157--1166, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Harman. Overview of the TREC 2002 novelty track. In Proc. 11th Text REtrieval Conf. (TREC 2002). NIST, 2002.]]Google ScholarGoogle Scholar
  9. N. Heintze. Scalable document fingerprinting. In Proc. USENIX Workshop on Electronic Commerce, November 1996.]]Google ScholarGoogle Scholar
  10. T. Hoad and J. Zobel. Methods for identifying versioned and plagiarised documents. Journal of the American Society of Information Science and Technology, 54(3):203--215, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. U. Manber. Finding similar files in a large file system. In Proc. USENIX Winter Technical Conf., pages 1--10, San Fransisco, CA, USA, 17--21 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. The RECAP system for identifying information flow. In Proc. 28th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, Aug. 2005. Demonstration abstract, to appear.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at terabyte track 2004. In Proc. 13th Text REtrieval Conf. (TREC 2004). NIST, 2004.]]Google ScholarGoogle Scholar
  14. V. Murdock and W. B. Croft. Simple translation models for sentence retrieval in factoid question answering. In Proc. SIGIR Workshop on Information Retrieval for Question Answering, pages 31--35, 2004.]]Google ScholarGoogle Scholar
  15. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. 21st Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 275--281, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Proc. 1st Text REtrieval Conf. (TREC 2001), pages 21--30. NIST, 1992.]]Google ScholarGoogle Scholar
  17. M. Sanderson. Duplicate detection in the Reuters collection. Technical Report TR-1997-5, University of Glasgow, 1997.]]Google ScholarGoogle Scholar
  18. N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. 2nd Conf. on the Theory and Practice of Digital Libraries, 1995.]]Google ScholarGoogle Scholar
  19. I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proc. 12th Text REtrieval Conf. (TREC 2003), pages 38--53. NIST, 2003.]]Google ScholarGoogle Scholar
  20. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad-hoc information retrieval. In Proc. 24th Ann. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 334--342, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Similarity measures for tracking information flow

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
      October 2005
      854 pages
      ISBN:1595931406
      DOI:10.1145/1099554

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 31 October 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      CIKM '05 Paper Acceptance Rate77of425submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader