skip to main content
10.3115/1073445.1073470dlproceedingsArticle/Chapter ViewAbstractPublication PagesnaaclConference Proceedingsconference-collections
Article
Free Access

Language and task independent text categorization with simple language models

Published:27 May 2003Publication History

ABSTRACT

We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing. To demonstrate the language and task independence of the proposed technique, we present experimental results on several languages---Greek, English, Chinese and Japanese---in several text categorization problems---language identification, authorship attribution, text genre classification, and topic detection. Our experimental results show that the simple approach achieves state of the art performance in each case.

References

  1. A. Aizawa. 2001. Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001).Google ScholarGoogle Scholar
  2. T. Bell, J. Cleary, and I. Witten. 1990. Text Compression. Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Benedetto, E. Caglioti, and V. Loreto. 2002. Language Trees and Zipping. Physical Review Letters, 88.Google ScholarGoogle Scholar
  4. W. Cavnar, J. Trenkle. 1994. N-Gram-Based Text Categorization. Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94).Google ScholarGoogle Scholar
  5. S. Chen and J. Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, TR-10-98, Harvard University.Google ScholarGoogle Scholar
  6. M. Damashek. 1995. Gauging Similarity with N-Grams: Language-Independent Categorization of Text?. Science, Vol. 267, 10 February, 843 - 848Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive Learning Algorithms And Representations For Text Categorization. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM98), Nov. 1998, pp. 148--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Goodman. 2002. Comment on Language Trees and Zipping. Unpublished Manuscript.Google ScholarGoogle Scholar
  9. J. He, A. Tan, and C. Tan. 2000. A Comparative Study on Chinese Text Categorization Methods. In Proceedings of PRICAI'2000 International Workshop on Text and Web Mining, p. 24--35.Google ScholarGoogle Scholar
  10. D. Holmes, and R. Forsyth. 1995. The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111--127.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. Kessler, G. Nunberg and H. Schüze. 1997. Automatic Detection of Text Genre. Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics (ACL1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Lee and S. Myaeng. 2002. Text Genre Classification with Genre-Revealing and Subject-Revealing Features. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Lewis. 1992. Representation and Learning in Information Retrieval Phd thesis, Computer Science Deptment, Univ. of Massachusetts. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. McCallum and K. Nigam. 1998. A Comparison of Event Models for Naive Bayes Text Classification. Proceedings of AAAI-98 Workshop on "Learning for Text Categorization", AAAI Presss.Google ScholarGoogle Scholar
  15. J. Rennie. 2001. Improving Multi-class Text Classification with Naive Bayes. Master's Thesis. M.I.T. AI Technical Report AITR-2001-004. 2001.Google ScholarGoogle Scholar
  16. S. Scott and S. Matwin. 1999. Feature Engineering for Text Classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML' 99), pp. 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 26 (4), 471--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Teahan and D. Harper. 2001. Using Compression-Based Language Models for Text Categorization. Proceedings of 2001 Workshop on Language Modeling and Information Retrieval.Google ScholarGoogle Scholar
  20. P. Turney. 2002. Thumbs Up or Thumbs Down? Semantic Oritentation Applied to Unsupervised Classification of Reviews. Proceedings of 40th Annual Conference of Association for Computational Linguistics (ACL 2002) Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1(1/2), pp. 67--88. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
    May 2003
    293 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 27 May 2003

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate21of29submissions,72%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader