Abstract
Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.
Similar content being viewed by others
References
AlSumait L, Domeniconi C (2008) Text clustering with local semantic kernels. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232
Apté C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: SIGIR ’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York, pp 23–30
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 436–442. doi:10.1145/775047.775110
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: ICDT ’99: proceedings of the 7th international conference on database theory. Springer, London, pp 217–235
Billhardt H, Borrajo D, Maojo V (2002) A context vector model for information retrieval. J Am Soc Inf Sci Technol 53(3): 236–249. doi:10.1002/asi.10032
Chen C, Tseng F, Liang T (2010) An integration of fuzzy association rules and wordnet for document clustering. Knowl Inf Syst (available online). doi:10.1007/s10115-010-0364-2
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407
Dhillon I, Modha D (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1): 143–175. doi:10.1023/A:1007612920971
Farahat A, Kamel M (2010) Statistical semantics for enhancing document clustering. Knowledge and Information Systems (available online). doi:10.1007/s10115-010-0367-z
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM international conference on data mining
Ghosh J, Strehl A (2006) Similarity-based text clustering: a comparative study. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 73–97
Grauman K, Darrell T (2007) The pyramid match kernel: efficient learning with sets of features. J Mach Learn Res 8: 725–760. doi:10.1145/361219.361220
Hotho A, Maedche E, Staab S (2001) Ontology-based text document clustering. Knstliche Intell 4: 48–54
Hu X, Sun N, Zhang C, Chua T (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and knowledge management. ACM, New York, CIKM ’09, pp 919–928. doi:10.1145/1645953.1646071
Jing J, Zhou L, Ng M, Huang Z (2006) Ontology-based distance measure for text clustering. In: Proceedings SIAM SDM workshop on text mining
Karypis G, Han E (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. In: Technical report TR-00-0016. University of Minnesota
Keikha M, Razavian N, Oroumchian F, Razi H (2008) Document representation and quality of text: an analysis. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232
Lebanon G, Mao Y, Dillon J (2007) The locally weighted bag of words framework for document representation. J Mach Learn Res 8: 2405–2441
Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 37–50. doi:10.1145/133160.133172
Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404. doi:10.1016/j.datak.2007.08.001
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkley symposium on mathematical statistics and probability. pp 281–297
Miller G, Beckwith R, Fellbaum C, Gross D, Miller K (1990) Wordnet: an on-line lexical database. Int J Lexicogr 3: 235–244
Mladenic D (1998) Machine learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Faculty of Computer and Information Science
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14: 849–864
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst 1–21. doi:10.1007/s10115-010-0299-7
Porter M (1997) An algorithm for suffix stripping. In: Jones K, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers, San Francisco, pp 313–316
Pu W, Liu N, Yan S, Yan J, Xie K, Chen Z (2007) Local word bag model for text categorization. In: ICDM ’07: proceedings of the 2007 7th IEEE international conference on data mining. IEEE Computer Society, Washington, pp 625–630. doi:10.1109/ICDM.2007.69
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620. doi:10.1145/361219.361220
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using wikipedia. In: KDD ’08: proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 713–721. doi:10.1145/1401890.1401976
Wikipedia (2004) Wikipedia, the free encyclopedia. http://en.wikipedia.org/
Wong S, Ziarko W, Wong P (1985) Generalized vector spaces model in information retrieval. In: SIGIR ’85: proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 18–25. doi:10.1145/253495.253506
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kalogeratos, A., Likas, A. Text document clustering using global term context vectors. Knowl Inf Syst 31, 455–474 (2012). https://doi.org/10.1007/s10115-011-0412-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0412-6