ABSTRACT
We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing. To demonstrate the language and task independence of the proposed technique, we present experimental results on several languages---Greek, English, Chinese and Japanese---in several text categorization problems---language identification, authorship attribution, text genre classification, and topic detection. Our experimental results show that the simple approach achieves state of the art performance in each case.
- A. Aizawa. 2001. Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001).Google Scholar
- T. Bell, J. Cleary, and I. Witten. 1990. Text Compression. Prentice Hall. Google ScholarDigital Library
- D. Benedetto, E. Caglioti, and V. Loreto. 2002. Language Trees and Zipping. Physical Review Letters, 88.Google Scholar
- W. Cavnar, J. Trenkle. 1994. N-Gram-Based Text Categorization. Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94).Google Scholar
- S. Chen and J. Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, TR-10-98, Harvard University.Google Scholar
- M. Damashek. 1995. Gauging Similarity with N-Grams: Language-Independent Categorization of Text?. Science, Vol. 267, 10 February, 843 - 848Google ScholarCross Ref
- S. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive Learning Algorithms And Representations For Text Categorization. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM98), Nov. 1998, pp. 148--155. Google ScholarDigital Library
- J. Goodman. 2002. Comment on Language Trees and Zipping. Unpublished Manuscript.Google Scholar
- J. He, A. Tan, and C. Tan. 2000. A Comparative Study on Chinese Text Categorization Methods. In Proceedings of PRICAI'2000 International Workshop on Text and Web Mining, p. 24--35.Google Scholar
- D. Holmes, and R. Forsyth. 1995. The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111--127.Google ScholarCross Ref
- B. Kessler, G. Nunberg and H. Schüze. 1997. Automatic Detection of Text Genre. Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics (ACL1997). Google ScholarDigital Library
- Y. Lee and S. Myaeng. 2002. Text Genre Classification with Genre-Revealing and Subject-Revealing Features. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2002). Google ScholarDigital Library
- D. Lewis. 1992. Representation and Learning in Information Retrieval Phd thesis, Computer Science Deptment, Univ. of Massachusetts. Google ScholarDigital Library
- A. McCallum and K. Nigam. 1998. A Comparison of Event Models for Naive Bayes Text Classification. Proceedings of AAAI-98 Workshop on "Learning for Text Categorization", AAAI Presss.Google Scholar
- J. Rennie. 2001. Improving Multi-class Text Classification with Naive Bayes. Master's Thesis. M.I.T. AI Technical Report AITR-2001-004. 2001.Google Scholar
- S. Scott and S. Matwin. 1999. Feature Engineering for Text Classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML' 99), pp. 379--388. Google ScholarDigital Library
- F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47. Google ScholarDigital Library
- E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 26 (4), 471--495. Google ScholarDigital Library
- W. Teahan and D. Harper. 2001. Using Compression-Based Language Models for Text Categorization. Proceedings of 2001 Workshop on Language Modeling and Information Retrieval.Google Scholar
- P. Turney. 2002. Thumbs Up or Thumbs Down? Semantic Oritentation Applied to Unsupervised Classification of Reviews. Proceedings of 40th Annual Conference of Association for Computational Linguistics (ACL 2002) Google ScholarDigital Library
- Y. Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1(1/2), pp. 67--88. Google ScholarDigital Library
Recommendations
Text independent root word identification in Hindi language using natural language processing
In this paper, an attempt is made to parse Hindi words to identify root word from an inflected word using natural language processing NLP technique. Stemming is a heuristic process that chops off the ends of words to find the root word and often ...
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational LinguisticsCross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian).In this work we present many ...
Comments