Skip to main content
Log in

Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Allan J, Carbonell J, Doddington G, Yamron JP, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 194–218

  2. Aslam JA, Frost M (2003) An information-theoretic measure for document similarity. In: Proceedings of the 26th international ACM/SIGIR conference on research and development in information retrieval, pp 449–450

  3. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrival. ACM Press and Addison Wesley

  4. Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, Montreal, Canada, pp 1–10

  5. Callan JP (1994) Passage-retrieval evidence in document retrieval. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 302–310

  6. Choi F (1999) JTextTile: a free platform independent text segmentation algorithm. http://www.cs.man.ac.uk/~choif

  7. Cooper J, Coden A, Brown E (2002) A novel method for detecting similar documents. In: Proceedings of the 35th annual Hawaii international conference on system sciences (HICSS'02), Big Island, Hawaii, pp 1153–1159

  8. Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. The MIT Press

  9. Croft B, Lafferty J (2003) Language modeling for information retrieval. Kluwer Academic Publishers

  10. Cruz IF, Borisov S, Marks MA, Webb TR (1998) Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th international conference on electronic publishing, Lecture notes in computer science, vol 1375, pp 513–524

    Google Scholar 

  11. Dang HT (2005) Overview of DUC 2005. In: Proceedings of the 2005 document understanding workshop

  12. Doermann D, Shin C, Rosenfeld A et al (1998) The development of a general framework for intelligent document image retrieval. In: Hull JJ, Taylor SL (eds) Document analysis systems II-series in machine perception and artificial intelligence, vol 29. World Scientific, pp 433–460

  13. Grosz BJ, Sidner CL (1986) Attention, intention, and the structure of discourse. Comput Linguistics 12(3):172–204

    Google Scholar 

  14. Hearst MA (1994) Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd meeting of the association for computational linguistics, Los Cruces, NM, pp 9–16

  15. Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM/SIGIR conference, Pittsburgh, PA, pp 59–63

  16. Iyer P, Singh A (2005) Document similarity analysis for a plagiarism detection system. In: Proceedings of the 2nd Indian international conference on artificial intelligence, pp 2534–2544

  17. Kaszkiel M, Zobel J (2001) Effective ranking with arbitrary passages. J Am Soc Inf Sci Technol 52(4):344–364

    Article  Google Scholar 

  18. Kaufmann S (1999) Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th conference on association for computational linguistics, pp 591–595

  19. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, 1998, pp 296–330

  20. Lopresti DP (1999) Models and algorithms for duplicate document detection. In: Proceedings of the fifth international conference on document analysis and recognition, Bangalore, India, pp 297–300

  21. Mann WC, Thompson SA (1987) Rhetorical structure theory: a theory of text organization. Technical Report ISI/RS, ISI, pp 87–190

  22. Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the fifth international workshop on the web and databases (WebDB), Madison, WI, pp 61–66

  23. Peng YX, Ngo CW, Dong QJ, Guo ZM, Xiao JG (2003) Video clip retrieval by maximal matching and optimal matching in graph theory. In: Proceedings of IEEE 2003 international conference on multimedia and expo (ICME), Maryland, USA, pp 317–320

  24. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Google Scholar 

  25. Robertson S, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th international ACM/SIGIR conference on research and development in information retrieval, pp 232–241

  26. Robertson S, Walker S, Beaulieu M (1999) Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of TREC-7, pp 253–264

  27. Salton G (1991) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall

  28. Schrijver A (2003) Combinatorial optimization: polyhedra and efficiency, vol A. Springer, Berlin

    Google Scholar 

  29. Shin C, Doermann D, Rosenfeld A (2003) Measuring structural similarity of document pages for searching document image Databases. In: Proceedings of the 9th international association of science and technology for development conference on signal and image processing, pp 320–325

  30. Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland, pp 21–29

  31. Stein B, zu Eissen SM (2005) Near similarity search and plagiarism analysis. In: Proceedings of the 29th annual conference of the German classification society (GfKl '05), studies in classification, data analysis, and knowledge organization, Springer

  32. van Rijsbergen CJ (1979) Information retrieval. Butterworths, London

    Google Scholar 

  33. Xiao WS (1993) Graph theory and its algorithms. Aviation Industry Press, Beijing

    Google Scholar 

  34. Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana, United States, pp 334–342

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaojun Wan.

Additional information

Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer Science and Technology of Peking University. His research interests include information retrieval and natural language processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wan, X. Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15, 55–73 (2008). https://doi.org/10.1007/s10115-006-0047-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0047-1

Keywords

Navigation