The Similarity Computing of Documents Based on VSM

Guo, Qinglin

doi:10.1007/978-3-540-85693-1_16

Qinglin Guo^1,2

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5186))

Included in the following conference series:

International Conference on Network-Based Information Systems

558 Accesses
6 Citations

Abstract

The precision and efficiency of the similarity computing of documents is the foundation and key of other documents processing. In this paper, the DF and TF-IDF algorithms are improved. First, DF’s time complexity is linear which suits mass documents processing, but it has the fault that exceptional useful features may be deleted, so we make up that by adding the count of the words at the important places. Second, we rectify the weight of feature by the result of feature selection phase. In this way, we improve the precision of documents similarity without adding much time and space complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Salton, G., McGill, M.G.: Introduction to Modem Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Wang, X.J.: Research on several problems in text retrieval. University of posts & telecommunications, Beijing. Dissertation of the degree of doctor (2006)
Google Scholar
Song, B.: Study on Chinese text similarity computing based on word segmentation, Tianjin University of Finance & Economics, Tianjin. Dissertation of the degree of master(2006)
Google Scholar
Dash, M., Liu, H.: Feature selection for classification. International Journal of Intelligent Data Analysis 3, 131–156 (2007)
Google Scholar
Liu, T.: An evaluation on feature Selection for text clustering. In: Proceedings of ICML 2006, Washington DC (2006)
Google Scholar
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of ACM SIGIR 2005 (2005)
Google Scholar
Yang, L., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 24th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (2007)
Google Scholar
Galavotti, S.F., Simi, M.: Feature selection and negative evidence in automated text categorization. In: Proceedings of KDD 2005, Boston, MA (2005)
Google Scholar
Yang, L.L.: A class-based feature selection algorithm for test clustering. International J. Computer Engineering and Applications 12, 144–146 (2007)
Google Scholar
Song, L., Zhang, Z.J.: The study on the Comprehensive Computation of Documents Similarity. International J. Computer Engineering and Applications 1, 1160–1163 (2006)
Google Scholar
Zhou, Y.X., Wang, G.S., Zhao, H.J.: Text Similarity Computing Based on Hanning Distance. International J. Computer Engineering and Applications 6, 109–116 (2006)
Google Scholar
The Lancaster corpus of mandarin Chinese (LCMC), http://www.ling.lancs.ac.uk/corplang/lcmc/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, North China Electric Power University, 102206, Beijing, China
Qinglin Guo
Department of Computer Science and Technology, Peking University, 100871, Beijing, China
Qinglin Guo

Authors

Qinglin Guo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Makoto Takizawa Leonard Barolli Tomoya Enokido

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, Q. (2008). The Similarity Computing of Documents Based on VSM. In: Takizawa, M., Barolli, L., Enokido, T. (eds) Network-Based Information Systems. NBiS 2008. Lecture Notes in Computer Science, vol 5186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85693-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-85693-1_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85692-4
Online ISBN: 978-3-540-85693-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics