Skip to main content

The Similarity Computing of Documents Based on VSM

  • Conference paper
Network-Based Information Systems (NBiS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5186))

Included in the following conference series:

Abstract

The precision and efficiency of the similarity computing of documents is the foundation and key of other documents processing. In this paper, the DF and TF-IDF algorithms are improved. First, DF’s time complexity is linear which suits mass documents processing, but it has the fault that exceptional useful features may be deleted, so we make up that by adding the count of the words at the important places. Second, we rectify the weight of feature by the result of feature selection phase. In this way, we improve the precision of documents similarity without adding much time and space complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Salton, G., McGill, M.G.: Introduction to Modem Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  2. Wang, X.J.: Research on several problems in text retrieval. University of posts & telecommunications, Beijing. Dissertation of the degree of doctor (2006)

    Google Scholar 

  3. Song, B.: Study on Chinese text similarity computing based on word segmentation, Tianjin University of Finance & Economics, Tianjin. Dissertation of the degree of master(2006)

    Google Scholar 

  4. Dash, M., Liu, H.: Feature selection for classification. International Journal of Intelligent Data Analysis 3, 131–156 (2007)

    Google Scholar 

  5. Liu, T.: An evaluation on feature Selection for text clustering. In: Proceedings of ICML 2006, Washington DC (2006)

    Google Scholar 

  6. Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of ACM SIGIR 2005 (2005)

    Google Scholar 

  7. Yang, L., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 24th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (2007)

    Google Scholar 

  8. Galavotti, S.F., Simi, M.: Feature selection and negative evidence in automated text categorization. In: Proceedings of KDD 2005, Boston, MA (2005)

    Google Scholar 

  9. Yang, L.L.: A class-based feature selection algorithm for test clustering. International J. Computer Engineering and Applications 12, 144–146 (2007)

    Google Scholar 

  10. Song, L., Zhang, Z.J.: The study on the Comprehensive Computation of Documents Similarity. International J. Computer Engineering and Applications 1, 1160–1163 (2006)

    Google Scholar 

  11. Zhou, Y.X., Wang, G.S., Zhao, H.J.: Text Similarity Computing Based on Hanning Distance. International J. Computer Engineering and Applications 6, 109–116 (2006)

    Google Scholar 

  12. The Lancaster corpus of mandarin Chinese (LCMC), http://www.ling.lancs.ac.uk/corplang/lcmc/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Makoto Takizawa Leonard Barolli Tomoya Enokido

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guo, Q. (2008). The Similarity Computing of Documents Based on VSM. In: Takizawa, M., Barolli, L., Enokido, T. (eds) Network-Based Information Systems. NBiS 2008. Lecture Notes in Computer Science, vol 5186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85693-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85693-1_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85692-4

  • Online ISBN: 978-3-540-85693-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics