Abstract
Accurately measuring document similarity is important for many text applications, e.g. document similarity search, document recommendation, etc. Most traditional similarity measures are based only on “bag of words” of documents and can well evaluate document topical similarity. In this paper, we propose the notion of document structural similarity, which is expected to further evaluate document similarity by comparing document subtopic structures. Three related factors (i.e. the optimal matching factor, the text order factor and the disturbing factor) are proposed and combined to evaluate document structural similarity, among which the optimal matching factor plays the key role and the other two factors rely on its results. The experimental results demonstrate the high performance of the optimal matching factor for evaluating document topical similarity, which is as well as or better than most popular measures. The user study shows the good ability of the proposed overall measure with all three factors to further find highly similar documents from those topically similar documents, which is much better than that of the popular measures and other baseline structural similarity measures.
Similar content being viewed by others
References
Allan J, Carbonell J, Doddington G, Yamron JP, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of DARPA broadcast news transcription and understanding workshop, pp 194–218
Aslam JA, Frost M (2003) An information-theoretic measure for document similarity. In: Proceedings of the 26th international ACM/SIGIR conference on research and development in information retrieval, pp 449–450
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrival. ACM Press and Addison Wesley
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, Montreal, Canada, pp 1–10
Callan JP (1994) Passage-retrieval evidence in document retrieval. In: Proceedings of the 17th annual international ACM-SIGIR conference on research and development in information retrieval, pp 302–310
Choi F (1999) JTextTile: a free platform independent text segmentation algorithm. http://www.cs.man.ac.uk/~choif
Cooper J, Coden A, Brown E (2002) A novel method for detecting similar documents. In: Proceedings of the 35th annual Hawaii international conference on system sciences (HICSS'02), Big Island, Hawaii, pp 1153–1159
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. The MIT Press
Croft B, Lafferty J (2003) Language modeling for information retrieval. Kluwer Academic Publishers
Cruz IF, Borisov S, Marks MA, Webb TR (1998) Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th international conference on electronic publishing, Lecture notes in computer science, vol 1375, pp 513–524
Dang HT (2005) Overview of DUC 2005. In: Proceedings of the 2005 document understanding workshop
Doermann D, Shin C, Rosenfeld A et al (1998) The development of a general framework for intelligent document image retrieval. In: Hull JJ, Taylor SL (eds) Document analysis systems II-series in machine perception and artificial intelligence, vol 29. World Scientific, pp 433–460
Grosz BJ, Sidner CL (1986) Attention, intention, and the structure of discourse. Comput Linguistics 12(3):172–204
Hearst MA (1994) Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd meeting of the association for computational linguistics, Los Cruces, NM, pp 9–16
Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM/SIGIR conference, Pittsburgh, PA, pp 59–63
Iyer P, Singh A (2005) Document similarity analysis for a plagiarism detection system. In: Proceedings of the 2nd Indian international conference on artificial intelligence, pp 2534–2544
Kaszkiel M, Zobel J (2001) Effective ranking with arbitrary passages. J Am Soc Inf Sci Technol 52(4):344–364
Kaufmann S (1999) Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th conference on association for computational linguistics, pp 591–595
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, 1998, pp 296–330
Lopresti DP (1999) Models and algorithms for duplicate document detection. In: Proceedings of the fifth international conference on document analysis and recognition, Bangalore, India, pp 297–300
Mann WC, Thompson SA (1987) Rhetorical structure theory: a theory of text organization. Technical Report ISI/RS, ISI, pp 87–190
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the fifth international workshop on the web and databases (WebDB), Madison, WI, pp 61–66
Peng YX, Ngo CW, Dong QJ, Guo ZM, Xiao JG (2003) Video clip retrieval by maximal matching and optimal matching in graph theory. In: Proceedings of IEEE 2003 international conference on multimedia and expo (ICME), Maryland, USA, pp 317–320
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Robertson S, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th international ACM/SIGIR conference on research and development in information retrieval, pp 232–241
Robertson S, Walker S, Beaulieu M (1999) Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In: Proceedings of TREC-7, pp 253–264
Salton G (1991) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall
Schrijver A (2003) Combinatorial optimization: polyhedra and efficiency, vol A. Springer, Berlin
Shin C, Doermann D, Rosenfeld A (2003) Measuring structural similarity of document pages for searching document image Databases. In: Proceedings of the 9th international association of science and technology for development conference on signal and image processing, pp 320–325
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland, pp 21–29
Stein B, zu Eissen SM (2005) Near similarity search and plagiarism analysis. In: Proceedings of the 29th annual conference of the German classification society (GfKl '05), studies in classification, data analysis, and knowledge organization, Springer
van Rijsbergen CJ (1979) Information retrieval. Butterworths, London
Xiao WS (1993) Graph theory and its algorithms. Aviation Industry Press, Beijing
Zhai C, Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana, United States, pp 334–342
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiaojun Wan received a B.Sc. degree in information science, a M.Sc. degree in computer science and a Ph.D. degree in computer science from Peking University, Beijing, China, in 2000, 2003 and 2006, respectively. He is currently a lecturer at Institute of Computer Science and Technology of Peking University. His research interests include information retrieval and natural language processing.
Rights and permissions
About this article
Cite this article
Wan, X. Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15, 55–73 (2008). https://doi.org/10.1007/s10115-006-0047-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0047-1