Abstract
The precision and efficiency of the similarity computing of documents is the foundation and key of other documents processing. In this paper, the DF and TF-IDF algorithms are improved. First, DF’s time complexity is linear which suits mass documents processing, but it has the fault that exceptional useful features may be deleted, so we make up that by adding the count of the words at the important places. Second, we rectify the weight of feature by the result of feature selection phase. In this way, we improve the precision of documents similarity without adding much time and space complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., McGill, M.G.: Introduction to Modem Information Retrieval. McGraw-Hill, New York (1983)
Wang, X.J.: Research on several problems in text retrieval. University of posts & telecommunications, Beijing. Dissertation of the degree of doctor (2006)
Song, B.: Study on Chinese text similarity computing based on word segmentation, Tianjin University of Finance & Economics, Tianjin. Dissertation of the degree of master(2006)
Dash, M., Liu, H.: Feature selection for classification. International Journal of Intelligent Data Analysis 3, 131–156 (2007)
Liu, T.: An evaluation on feature Selection for text clustering. In: Proceedings of ICML 2006, Washington DC (2006)
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of ACM SIGIR 2005 (2005)
Yang, L., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 24th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (2007)
Galavotti, S.F., Simi, M.: Feature selection and negative evidence in automated text categorization. In: Proceedings of KDD 2005, Boston, MA (2005)
Yang, L.L.: A class-based feature selection algorithm for test clustering. International J. Computer Engineering and Applications 12, 144–146 (2007)
Song, L., Zhang, Z.J.: The study on the Comprehensive Computation of Documents Similarity. International J. Computer Engineering and Applications 1, 1160–1163 (2006)
Zhou, Y.X., Wang, G.S., Zhao, H.J.: Text Similarity Computing Based on Hanning Distance. International J. Computer Engineering and Applications 6, 109–116 (2006)
The Lancaster corpus of mandarin Chinese (LCMC), http://www.ling.lancs.ac.uk/corplang/lcmc/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guo, Q. (2008). The Similarity Computing of Documents Based on VSM. In: Takizawa, M., Barolli, L., Enokido, T. (eds) Network-Based Information Systems. NBiS 2008. Lecture Notes in Computer Science, vol 5186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85693-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-85693-1_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85692-4
Online ISBN: 978-3-540-85693-1
eBook Packages: Computer ScienceComputer Science (R0)