Skip to main content
Log in

Text document clustering using global term context vectors

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. AlSumait L, Domeniconi C (2008) Text clustering with local semantic kernels. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232

    Google Scholar 

  2. Apté C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: SIGIR ’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York, pp 23–30

  3. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 436–442. doi:10.1145/775047.775110

  4. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: ICDT ’99: proceedings of the 7th international conference on database theory. Springer, London, pp 217–235

  5. Billhardt H, Borrajo D, Maojo V (2002) A context vector model for information retrieval. J Am Soc Inf Sci Technol 53(3): 236–249. doi:10.1002/asi.10032

    Article  Google Scholar 

  6. Chen C, Tseng F, Liang T (2010) An integration of fuzzy association rules and wordnet for document clustering. Knowl Inf Syst (available online). doi:10.1007/s10115-010-0364-2

  7. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407

    Article  Google Scholar 

  8. Dhillon I, Modha D (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1): 143–175. doi:10.1023/A:1007612920971

    Article  MATH  Google Scholar 

  9. Farahat A, Kamel M (2010) Statistical semantics for enhancing document clustering. Knowledge and Information Systems (available online). doi:10.1007/s10115-010-0367-z

  10. Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM international conference on data mining

  11. Ghosh J, Strehl A (2006) Similarity-based text clustering: a comparative study. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 73–97

    Chapter  Google Scholar 

  12. Grauman K, Darrell T (2007) The pyramid match kernel: efficient learning with sets of features. J Mach Learn Res 8: 725–760. doi:10.1145/361219.361220

    MATH  Google Scholar 

  13. Hotho A, Maedche E, Staab S (2001) Ontology-based text document clustering. Knstliche Intell 4: 48–54

    Google Scholar 

  14. Hu X, Sun N, Zhang C, Chua T (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and knowledge management. ACM, New York, CIKM ’09, pp 919–928. doi:10.1145/1645953.1646071

  15. Jing J, Zhou L, Ng M, Huang Z (2006) Ontology-based distance measure for text clustering. In: Proceedings SIAM SDM workshop on text mining

  16. Karypis G, Han E (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. In: Technical report TR-00-0016. University of Minnesota

  17. Keikha M, Razavian N, Oroumchian F, Razi H (2008) Document representation and quality of text: an analysis. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232

    Chapter  Google Scholar 

  18. Lebanon G, Mao Y, Dillon J (2007) The locally weighted bag of words framework for document representation. J Mach Learn Res 8: 2405–2441

    MathSciNet  MATH  Google Scholar 

  19. Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 37–50. doi:10.1145/133160.133172

  20. Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404. doi:10.1016/j.datak.2007.08.001

    Google Scholar 

  21. McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkley symposium on mathematical statistics and probability. pp 281–297

  22. Miller G, Beckwith R, Fellbaum C, Gross D, Miller K (1990) Wordnet: an on-line lexical database. Int J Lexicogr 3: 235–244

    Article  Google Scholar 

  23. Mladenic D (1998) Machine learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Faculty of Computer and Information Science

  24. Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14: 849–864

    Google Scholar 

  25. Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst 1–21. doi:10.1007/s10115-010-0299-7

  26. Porter M (1997) An algorithm for suffix stripping. In: Jones K, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers, San Francisco, pp 313–316

    Google Scholar 

  27. Pu W, Liu N, Yan S, Yan J, Xie K, Chen Z (2007) Local word bag model for text categorization. In: ICDM ’07: proceedings of the 2007 7th IEEE international conference on data mining. IEEE Computer Society, Washington, pp 625–630. doi:10.1109/ICDM.2007.69

  28. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620. doi:10.1145/361219.361220

    Article  MATH  Google Scholar 

  29. Wang P, Domeniconi C (2008) Building semantic kernels for text classification using wikipedia. In: KDD ’08: proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 713–721. doi:10.1145/1401890.1401976

  30. Wikipedia (2004) Wikipedia, the free encyclopedia. http://en.wikipedia.org/

  31. Wong S, Ziarko W, Wong P (1985) Generalized vector spaces model in information retrieval. In: SIGIR ’85: proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 18–25. doi:10.1145/253495.253506

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aristidis Likas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalogeratos, A., Likas, A. Text document clustering using global term context vectors. Knowl Inf Syst 31, 455–474 (2012). https://doi.org/10.1007/s10115-011-0412-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0412-6

Keywords

Navigation