Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

Benkhalifa, Mohammed; Mouradi, Abdelhak; Bouyakhf, Houssaine

doi:10.1023/A:1011458711300

Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

Published: July 2001

Volume 4, pages 91–113, (2001)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

Download PDF

Mohammed Benkhalifa¹,
Abdelhak Mouradi² &
Houssaine Bouyakhf³

127 Accesses
12 Citations
Explore all metrics

Abstract

Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the “Semi-Supervised Fuzzy c-Means” (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data.

Article PDF

Semi-supervised Text Categorization Using Recursive K-means Clustering

Automated Document Categorization Model

A Hybrid Approach for Classification of Text Documents Using Naïve Bayes and Instance-Based Learning

References

Apté C, Damerau FJ and Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 13: 233-251.
Google Scholar
Benkhalifa M, Mouradi A and Bensaid A (1999) Text categorization using the semi-supervised fuzzy c-means algorithm, In: Eighteenth International Conference of North American Fuzzy Processing. New York, pp. 561-565.
Bensaid A and Bezdek JC (1996) Partial supervision based on point prototype clustering algorithms. In: Fourth European Congress on Intelligent Techniques and Soft Computing. Aachen, Germany, pp. 1402-1406.
Bensaid A, Lawrence OH, Bezdek JC and Laurence PC (1996) Partially supervised clustering for image segmentation. Pattern Recognition, 29: 859-879.
Google Scholar
Bezdek JC (1981) Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York.
Google Scholar
Bezdek JC, Reicherzer TR, Lim G and Attikiouzel Y (1998) Multiple prototype classifier design. IEEE Transactions. SMC, Part C, 28: 67-79.
Google Scholar
Biebricher P, Fuhr N, Lustig G and Schwantner M (1988) The automatic indexing system AIR/PHYS from research to application. In: Eleventh International Conference on Research and Development in Information Retrieval, pp. 333-342.
Buenaga MR, Gomez-Hidalgo JM and Diaz-Agudo B (1997) Using wordnet to complement training information in text categorization. In: 2nd International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BL, pp. 150-157.
Google Scholar
Crawford Stuart L, Robert MF, Lee AA and Richard MT (1991) Classification trees for information retrieval. In: Eighth International Workshop on Machine Learning, pp. 245-249.
Creecy RM, Masand BM, Smith SJ andWaltz DL (1992) Trading MIPS and memory for knowledge engineering. Communications of the ACM, 35: 48-63.
Google Scholar
Fellbaum, C (ed.). (1998) WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA.
Google Scholar
Frakes WB and Baeza-Yates R (1992) Information Retrieval, Data Structures and Algorithms, Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Hayes PJ andWeistein SP (1990) Construe/tis: A system for content-based indexing for a database of new stories. In: Second Annual Conference on Innovative Applications of Artificial Intelligence.
Junker M and Abecker A (1997) Exploiting thesaurus knowledge in rule induction for text classification. In: Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing, pp. 202-207.
Larkey LS and Croft WB (1996) Combining classifiers in text categorization. In: Proceedings of the ACMSIGIR-96, pp. 289-297.
Lewis DD (1992) Representation and learning in information retrieval. Ph.D. Thesis, University of Massachusetts at Amherst. Technical report pp. 91-93.
Lewis DD (1998) The Reuters-21578 text categorization test collection description. http://www.research.att.com/ ~lewis/reuters21578.html (first visit in September 1998).
Lewis DD and Ringuette M (1994) Comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93.
Miller G (1995) WordNet: A lexical database for English. Communications of the ACM, 38(11): 39-41.
Google Scholar
Nigam K, McCallum AK, Thrun S and Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, (39): 103-134.
Google Scholar
Pedrycz W (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognition Letters, 3: 13-20.
Google Scholar
Ruiz ME and Srinivasan P (1999) Hierarchical neural networks for text categorization. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval. Berkeley, USA, pp. 281-282.
Google Scholar
Salton G and Buckley C (1988) Term weighting approaches in automatic text retrieval. Information Processing and Management, 24: 513-523.
Google Scholar
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval, McGraw-Hill Computer Science Series. McGraw-Hill, New York.
Google Scholar
Scott S and Matwin S (1999). Feature engineering for text categorization. In: Proceedings of 16th International Conference on Machine Learning Bled, SL, pp. 379-388.
Tazi NL, Bensaid A and Bezdek JC (1998) Improved semi supervised point prototype clustering algorithms. IEEE international Conference on Fuzzy Systems, IEEEWorld Congress on Computational Intelligence, Anchorage, Alaska.
Tzeras K and Hartman S (1993) Automatic indexing based on bayesian inference networks. In: Proceedings of the 16th Interrnational ACM Conference on Research and Development in Information Retrieval, pp. 22-34.
Van Rijsbergen CJ (1979) Information Retrieval, Butterworths, London.
Google Scholar
Voorhees and Ellen M (1994) Query expansion using lexical-semantic relations. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval, pp. 61-69.
Wiener E, Pedersen JO and Weigend AS (1995) A neural network approach to topic spotting. In: Proceedings of SDAIR-95, the Fourth Annual Symposium on Document Analysis and Information Retrieval, pp. 317-332.
Yang Y (1997) An evaluation of statistical approach to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University.
Yang Y (1999) An evaluation of statistical approaches to text categorization, Information Retrieval, 1: 69-90.
Google Scholar
Yang Y and Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Transactions On Information Systems, pp. 253-277.
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, USA, pp. 42-49.
Google Scholar
Yang Y and Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, the 14th International Conference on Machine Learning, pp. 412-420.

Download references

Author information

Authors and Affiliations

School of Science and Engineering, Al Akhawayn University in Ifrane (AUI), Av. Hassan II, Ifrane, 53000, Morocco
Mohammed Benkhalifa
Ecole Nationale Superieure d'Informatique et d'Analyses des Systémes (ENSIAS), Mohammed V University, Agdal Rabat, p??
Abdelhak Mouradi
Computer Science Department, Mohammed V University, Facuty of Sciences in Rabat, Morocco
Houssaine Bouyakhf

Authors

Mohammed Benkhalifa
View author publications
You can also search for this author in PubMed Google Scholar
Abdelhak Mouradi
View author publications
You can also search for this author in PubMed Google Scholar
Houssaine Bouyakhf
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Benkhalifa, M., Mouradi, A. & Bouyakhf, H. Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization. Information Retrieval 4, 91–113 (2001). https://doi.org/10.1023/A:1011458711300

Download citation

Issue Date: July 2001
DOI: https://doi.org/10.1023/A:1011458711300

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

Abstract

Article PDF

Similar content being viewed by others

Semi-supervised Text Categorization Using Recursive K-means Clustering

Automated Document Categorization Model

A Hybrid Approach for Classification of Text Documents Using Naïve Bayes and Instance-Based Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

Abstract

Article PDF

Similar content being viewed by others

Semi-supervised Text Categorization Using Recursive K-means Clustering

Automated Document Categorization Model

A Hybrid Approach for Classification of Text Documents Using Naïve Bayes and Instance-Based Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation