Abstract
High dimensionality of feature space is a crucial obstacle for Automated Text Categorization. According to the characteristics of Chinese character N-grams, this paper reveals that there exists a kind of redundancy arising from feature overlapping. Focusing on Chinese character bigrams, the paper puts forward a concept of δ-overlapping between two bigrams, and proposes a new method of dimensionality reduction, called δ-Overlapped Raising (δ – OR), by raising the δ-overlapped bigrams into their corresponding trigrams. Moreover, the paper designs a two-stage dimensionality reduction strategy for Chinese bigrams by integrating a filtering method based on Chi-CIG score function and the δ – OR method. Experimental results on a large-scale Chinese document collection indicate that, on the basis of the first stage of reduction processing, δ – OR at the second stage can significantly reduce the dimension of feature space without sacrificing categorization effectiveness. We believe that the above methodology would be language-independent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Lewis, D.D.: Naïve Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–21 (1994)
Apte, C., Damerau, F., Weiss, S.M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Retrieval 12(3), 233–251 (1994)
Lertnattee, V., Theeramunkong, T.: Improving Centroid-Based Text Classification Using Term-Distribution-Based Weighting and Feature Selection. In: Proceedings of International Conference on Intelligent Technologies, pp. 349–356 (2001)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of 14th of International Conference on Machine Learning, pp. 143–151 (1997)
Joachims, T.: Text Categorization with Support Vector Machines: Learnging with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-Based System for Text Categorization. Machine Learning 39(2/3), 135–168 (2000)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York (1983)
Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization. In: Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997)
Molina, L.C., Belanche, L., Nebot, A.: Feature Selection Algorithms: A Survey and Experimental Evaluation. In: Proceedings of 2nd IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 306–313 (2002)
Li, Y.H., Jain, A.K.: Classification of Text Document. The Computer Journal 41(8), 537–546 (1998)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Indexing. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Schutze, H., Hull, D.A., Pedersen, J.O.: A comparison of Classifiers and Document Representations for the Routing Problem. In: Proceedings of 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 229–237 (1995)
Nie, J., Ren, F.: Chinese Information Retrieval: Using Characters or Words? Information Processing and Management 35, 443–462 (1999)
Xue, D., Sun, M.: A Study on Feature Weighting in Chinese Text Categorization. In: Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, pp. 594–604 (2003)
Luo, S.: Statistic-Based Two-Character Chinese Word Extraction. Master Thesis of Tsinghua University, China (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xue, D., Sun, M. (2004). Raising High-Degree Overlapped Character Bigrams into Trigrams for Dimensionality Reduction in Chinese Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_72
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_72
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive