Text document clustering using global term context vectors

Kalogeratos, Argyris; Likas, Aristidis

doi:10.1007/s10115-011-0412-6

Text document clustering using global term context vectors

Regular Paper
Published: 28 May 2011

Volume 31, pages 455–474, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Argyris Kalogeratos¹ &
Aristidis Likas¹

514 Accesses
26 Citations
3 Altmetric
Explore all metrics

Abstract

Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

AlSumait L, Domeniconi C (2008) Text clustering with local semantic kernels. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232
Google Scholar
Apté C, Damerau F, Weiss SM (1994) Towards language independent automated learning of text categorization models. In: SIGIR ’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York, pp 23–30
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 436–442. doi:10.1145/775047.775110
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: ICDT ’99: proceedings of the 7th international conference on database theory. Springer, London, pp 217–235
Billhardt H, Borrajo D, Maojo V (2002) A context vector model for information retrieval. J Am Soc Inf Sci Technol 53(3): 236–249. doi:10.1002/asi.10032
Article Google Scholar
Chen C, Tseng F, Liang T (2010) An integration of fuzzy association rules and wordnet for document clustering. Knowl Inf Syst (available online). doi:10.1007/s10115-010-0364-2
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41: 391–407
Article Google Scholar
Dhillon I, Modha D (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1): 143–175. doi:10.1023/A:1007612920971
Article MATH Google Scholar
Farahat A, Kamel M (2010) Statistical semantics for enhancing document clustering. Knowledge and Information Systems (available online). doi:10.1007/s10115-010-0367-z
Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM international conference on data mining
Ghosh J, Strehl A (2006) Similarity-based text clustering: a comparative study. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 73–97
Chapter Google Scholar
Grauman K, Darrell T (2007) The pyramid match kernel: efficient learning with sets of features. J Mach Learn Res 8: 725–760. doi:10.1145/361219.361220
MATH Google Scholar
Hotho A, Maedche E, Staab S (2001) Ontology-based text document clustering. Knstliche Intell 4: 48–54
Google Scholar
Hu X, Sun N, Zhang C, Chua T (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and knowledge management. ACM, New York, CIKM ’09, pp 919–928. doi:10.1145/1645953.1646071
Jing J, Zhou L, Ng M, Huang Z (2006) Ontology-based distance measure for text clustering. In: Proceedings SIAM SDM workshop on text mining
Karypis G, Han E (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. In: Technical report TR-00-0016. University of Minnesota
Keikha M, Razavian N, Oroumchian F, Razi H (2008) Document representation and quality of text: an analysis. In: Berry M, Castellanos M (eds) Survey of text mining II. Springer, London, pp 219–232
Chapter Google Scholar
Lebanon G, Mao Y, Dillon J (2007) The locally weighted bag of words framework for document representation. J Mach Learn Res 8: 2405–2441
MathSciNet MATH Google Scholar
Lewis D (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 37–50. doi:10.1145/133160.133172
Li Y, Chung S, Holt J (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404. doi:10.1016/j.datak.2007.08.001
Google Scholar
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkley symposium on mathematical statistics and probability. pp 281–297
Miller G, Beckwith R, Fellbaum C, Gross D, Miller K (1990) Wordnet: an on-line lexical database. Int J Lexicogr 3: 235–244
Article Google Scholar
Mladenic D (1998) Machine learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Faculty of Computer and Information Science
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14: 849–864
Google Scholar
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst 1–21. doi:10.1007/s10115-010-0299-7
Porter M (1997) An algorithm for suffix stripping. In: Jones K, Willett P (eds) Readings in information retrieval. Morgan Kaufmann Publishers, San Francisco, pp 313–316
Google Scholar
Pu W, Liu N, Yan S, Yan J, Xie K, Chen Z (2007) Local word bag model for text categorization. In: ICDM ’07: proceedings of the 2007 7th IEEE international conference on data mining. IEEE Computer Society, Washington, pp 625–630. doi:10.1109/ICDM.2007.69
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620. doi:10.1145/361219.361220
Article MATH Google Scholar
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using wikipedia. In: KDD ’08: proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 713–721. doi:10.1145/1401890.1401976
Wikipedia (2004) Wikipedia, the free encyclopedia. http://en.wikipedia.org/
Wong S, Ziarko W, Wong P (1985) Generalized vector spaces model in information retrieval. In: SIGIR ’85: proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 18–25. doi:10.1145/253495.253506

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Ioannina, 45110, Ioannina, Greece
Argyris Kalogeratos & Aristidis Likas

Authors

Argyris Kalogeratos
View author publications
You can also search for this author in PubMed Google Scholar
Aristidis Likas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aristidis Likas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalogeratos, A., Likas, A. Text document clustering using global term context vectors. Knowl Inf Syst 31, 455–474 (2012). https://doi.org/10.1007/s10115-011-0412-6

Download citation

Received: 13 May 2010
Revised: 20 December 2010
Accepted: 06 May 2011
Published: 28 May 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10115-011-0412-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text document clustering using global term context vectors

Abstract

Access this article

Similar content being viewed by others

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Clustering graph data: the roadmap to spectral techniques

A survey on neural topic models: methods, applications, and challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text document clustering using global term context vectors

Abstract

Access this article

Similar content being viewed by others

Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis

Clustering graph data: the roadmap to spectral techniques

A survey on neural topic models: methods, applications, and challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation