Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Wan, Xiaojun

doi:10.1007/s10115-006-0047-1

Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Regular Paper
Published: 11 October 2006

Volume 15, pages 55–73, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Xiaojun Wan¹

285 Accesses
18 Citations
Explore all metrics

Abstract

Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Research-paper recommender systems: a literature survey

Article 26 July 2015

References

Allan J, Carbonell J, Doddington G, Yamron JP, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 194–218
Aslam JA, Frost M (2003) An information-theoretic measure for document similarity. In: Proceedings of the 26th international ACM/SIGIR conference on research and development in information retrieval, pp 449–450
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrival. ACM Press and Addison Wesley
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, Montreal, Canada, pp 1–10
Callan JP (1994) Passage-retrieval evidence in document retrieval. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 302–310
Choi F (1999) JTextTile: a free platform independent text segmentation algorithm. http://www.cs.man.ac.uk/~choif
Cooper J, Coden A, Brown E (2002) A novel method for detecting similar documents. In: Proceedings of the 35th annual Hawaii international conference on system sciences (HICSS'02), Big Island, Hawaii, pp 1153–1159
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. The MIT Press
Croft B, Lafferty J (2003) Language modeling for information retrieval. Kluwer Academic Publishers
Cruz IF, Borisov S, Marks MA, Webb TR (1998) Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th international conference on electronic publishing, Lecture notes in computer science, vol 1375, pp 513–524
Google Scholar
Dang HT (2005) Overview of DUC 2005. In: Proceedings of the 2005 document understanding workshop
Doermann D, Shin C, Rosenfeld A et al (1998) The development of a general framework for intelligent document image retrieval. In: Hull JJ, Taylor SL (eds) Document analysis systems II-series in machine perception and artificial intelligence, vol 29. World Scientific, pp 433–460
Grosz BJ, Sidner CL (1986) Attention, intention, and the structure of discourse. Comput Linguistics 12(3):172–204
Google Scholar
Hearst MA (1994) Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd meeting of the association for computational linguistics, Los Cruces, NM, pp 9–16
Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM/SIGIR conference, Pittsburgh, PA, pp 59–63
Iyer P, Singh A (2005) Document similarity analysis for a plagiarism detection system. In: Proceedings of the 2nd Indian international conference on artificial intelligence, pp 2534–2544
Kaszkiel M, Zobel J (2001) Effective ranking with arbitrary passages. J Am Soc Inf Sci Technol 52(4):344–364
Article Google Scholar
Kaufmann S (1999) Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th conference on association for computational linguistics, pp 591–595
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, 1998, pp 296–330
Lopresti DP (1999) Models and algorithms for duplicate document detection. In: Proceedings of the fifth international conference on document analysis and recognition, Bangalore, India, pp 297–300
Mann WC, Thompson SA (1987) Rhetorical structure theory: a theory of text organization. Technical Report ISI/RS, ISI, pp 87–190
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the fifth international workshop on the web and databases (WebDB), Madison, WI, pp 61–66
Peng YX, Ngo CW, Dong QJ, Guo ZM, Xiao JG (2003) Video clip retrieval by maximal matching and optimal matching in graph theory. In: Proceedings of IEEE 2003 international conference on multimedia and expo (ICME), Maryland, USA, pp 317–320
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Google Scholar
Robertson S, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th international ACM/SIGIR conference on research and development in information retrieval, pp 232–241
Robertson S, Walker S, Beaulieu M (1999) Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of TREC-7, pp 253–264
Salton G (1991) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall
Schrijver A (2003) Combinatorial optimization: polyhedra and efficiency, vol A. Springer, Berlin
Google Scholar
Shin C, Doermann D, Rosenfeld A (2003) Measuring structural similarity of document pages for searching document image Databases. In: Proceedings of the 9th international association of science and technology for development conference on signal and image processing, pp 320–325
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland, pp 21–29
Stein B, zu Eissen SM (2005) Near similarity search and plagiarism analysis. In: Proceedings of the 29th annual conference of the German classification society (GfKl '05), studies in classification, data analysis, and knowledge organization, Springer
van Rijsbergen CJ (1979) Information retrieval. Butterworths, London
Google Scholar
Xiao WS (1993) Graph theory and its algorithms. Aviation Industry Press, Beijing
Google Scholar
Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana, United States, pp 334–342

Download references

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Xiaojun Wan

Authors

Xiaojun Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaojun Wan.

Additional information

Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer Science and Technology of Peking University. His research interests include information retrieval and natural language processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wan, X. Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15, 55–73 (2008). https://doi.org/10.1007/s10115-006-0047-1

Download citation

Received: 12 October 2005
Revised: 06 May 2006
Accepted: 10 August 2006
Published: 11 October 2006
Issue Date: April 2008
DOI: https://doi.org/10.1007/s10115-006-0047-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Abstract

Access this article

Similar content being viewed by others

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

Research-paper recommender systems: a literature survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Beyond topical similarity: a structural similarity measure for retrieving highly similar documents

Abstract

Access this article

Similar content being viewed by others

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

Research-paper recommender systems: a literature survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation