Abstract
Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the “Semi-Supervised Fuzzy c-Means” (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data.
Article PDF
Similar content being viewed by others
References
Apté C, Damerau FJ and Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 13: 233-251.
Benkhalifa M, Mouradi A and Bensaid A (1999) Text categorization using the semi-supervised fuzzy c-means algorithm, In: Eighteenth International Conference of North American Fuzzy Processing. New York, pp. 561-565.
Bensaid A and Bezdek JC (1996) Partial supervision based on point prototype clustering algorithms. In: Fourth European Congress on Intelligent Techniques and Soft Computing. Aachen, Germany, pp. 1402-1406.
Bensaid A, Lawrence OH, Bezdek JC and Laurence PC (1996) Partially supervised clustering for image segmentation. Pattern Recognition, 29: 859-879.
Bezdek JC (1981) Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York.
Bezdek JC, Reicherzer TR, Lim G and Attikiouzel Y (1998) Multiple prototype classifier design. IEEE Transactions. SMC, Part C, 28: 67-79.
Biebricher P, Fuhr N, Lustig G and Schwantner M (1988) The automatic indexing system AIR/PHYS from research to application. In: Eleventh International Conference on Research and Development in Information Retrieval, pp. 333-342.
Buenaga MR, Gomez-Hidalgo JM and Diaz-Agudo B (1997) Using wordnet to complement training information in text categorization. In: 2nd International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BL, pp. 150-157.
Crawford Stuart L, Robert MF, Lee AA and Richard MT (1991) Classification trees for information retrieval. In: Eighth International Workshop on Machine Learning, pp. 245-249.
Creecy RM, Masand BM, Smith SJ andWaltz DL (1992) Trading MIPS and memory for knowledge engineering. Communications of the ACM, 35: 48-63.
Fellbaum, C (ed.). (1998) WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA.
Frakes WB and Baeza-Yates R (1992) Information Retrieval, Data Structures and Algorithms, Prentice Hall, Englewood Cliffs, NJ.
Hayes PJ andWeistein SP (1990) Construe/tis: A system for content-based indexing for a database of new stories. In: Second Annual Conference on Innovative Applications of Artificial Intelligence.
Junker M and Abecker A (1997) Exploiting thesaurus knowledge in rule induction for text classification. In: Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing, pp. 202-207.
Larkey LS and Croft WB (1996) Combining classifiers in text categorization. In: Proceedings of the ACMSIGIR-96, pp. 289-297.
Lewis DD (1992) Representation and learning in information retrieval. Ph.D. Thesis, University of Massachusetts at Amherst. Technical report pp. 91-93.
Lewis DD (1998) The Reuters-21578 text categorization test collection description. http://www.research.att.com/ ~lewis/reuters21578.html (first visit in September 1998).
Lewis DD and Ringuette M (1994) Comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93.
Miller G (1995) WordNet: A lexical database for English. Communications of the ACM, 38(11): 39-41.
Nigam K, McCallum AK, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, (39): 103-134.
Pedrycz W (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognition Letters, 3: 13-20.
Ruiz ME and Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval. Berkeley, USA, pp. 281-282.
Salton G and Buckley C (1988) Term weighting approaches in automatic text retrieval. Information Processing and Management, 24: 513-523.
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval, McGraw-Hill Computer Science Series. McGraw-Hill, New York.
Scott S and Matwin S (1999). Feature engineering for text categorization. In: Proceedings of 16th International Conference on Machine Learning Bled, SL, pp. 379-388.
Tazi NL, Bensaid A and Bezdek JC (1998) Improved semi supervised point prototype clustering algorithms. IEEE international Conference on Fuzzy Systems, IEEEWorld Congress on Computational Intelligence, Anchorage, Alaska.
Tzeras K and Hartman S (1993) Automatic indexing based on bayesian inference networks. In: Proceedings of the 16th Interrnational ACM Conference on Research and Development in Information Retrieval, pp. 22-34.
Van Rijsbergen CJ (1979) Information Retrieval, Butterworths, London.
Voorhees and Ellen M (1994) Query expansion using lexical-semantic relations. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval, pp. 61-69.
Wiener E, Pedersen JO and Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of SDAIR-95, the Fourth Annual Symposium on Document Analysis and Information Retrieval, pp. 317-332.
Yang Y (1997) An evaluation of statistical approach to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University.
Yang Y (1999) An evaluation of statistical approaches to text categorization, Information Retrieval, 1: 69-90.
Yang Y and Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Transactions On Information Systems, pp. 253-277.
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, USA, pp. 42-49.
Yang Y and Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, the 14th International Conference on Machine Learning, pp. 412-420.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Benkhalifa, M., Mouradi, A. & Bouyakhf, H. Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization. Information Retrieval 4, 91–113 (2001). https://doi.org/10.1023/A:1011458711300
Issue Date:
DOI: https://doi.org/10.1023/A:1011458711300