Article

Free Access

Language and task independent text categorization with simple language models

Authors:
Fuchun Peng

University of Waterloo, Waterloo, Ontario, Canada

University of Waterloo, Waterloo, Ontario, Canada
View Profile

,
Dale Schuurmans

University of Waterloo, Waterloo, Ontario, Canada

University of Waterloo, Waterloo, Ontario, Canada
View Profile

,
Shaojun Wang

University of Waterloo, Waterloo, Ontario, Canada

University of Waterloo, Waterloo, Ontario, Canada
View Profile

NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1May 2003Pages 110–117https://doi.org/10.3115/1073445.1073470

Published:27 May 2003Publication History

NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Pages 110–117

ABSTRACT

We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing. To demonstrate the language and task independence of the proposed technique, we present experimental results on several languages---Greek, English, Chinese and Japanese---in several text categorization problems---language identification, authorship attribution, text genre classification, and topic detection. Our experimental results show that the simple approach achieves state of the art performance in each case.

References

A. Aizawa. 2001. Linguistic Techniques to Improve the Performance of Automatic Text Categorization. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001).Google Scholar
T. Bell, J. Cleary, and I. Witten. 1990. Text Compression. Prentice Hall. Google ScholarDigital Library
D. Benedetto, E. Caglioti, and V. Loreto. 2002. Language Trees and Zipping. Physical Review Letters, 88.Google Scholar
W. Cavnar, J. Trenkle. 1994. N-Gram-Based Text Categorization. Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94).Google Scholar
S. Chen and J. Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, TR-10-98, Harvard University.Google Scholar
M. Damashek. 1995. Gauging Similarity with N-Grams: Language-Independent Categorization of Text?. Science, Vol. 267, 10 February, 843 - 848Google ScholarCross Ref
S. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive Learning Algorithms And Representations For Text Categorization. In Proceedings of ACM Conference on Information and Knowledge Management (CIKM98), Nov. 1998, pp. 148--155. Google ScholarDigital Library
J. Goodman. 2002. Comment on Language Trees and Zipping. Unpublished Manuscript.Google Scholar
J. He, A. Tan, and C. Tan. 2000. A Comparative Study on Chinese Text Categorization Methods. In Proceedings of PRICAI'2000 International Workshop on Text and Web Mining, p. 24--35.Google Scholar
D. Holmes, and R. Forsyth. 1995. The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111--127.Google ScholarCross Ref
B. Kessler, G. Nunberg and H. Schüze. 1997. Automatic Detection of Text Genre. Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics (ACL1997). Google ScholarDigital Library
Y. Lee and S. Myaeng. 2002. Text Genre Classification with Genre-Revealing and Subject-Revealing Features. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2002). Google ScholarDigital Library
D. Lewis. 1992. Representation and Learning in Information Retrieval Phd thesis, Computer Science Deptment, Univ. of Massachusetts. Google ScholarDigital Library
A. McCallum and K. Nigam. 1998. A Comparison of Event Models for Naive Bayes Text Classification. Proceedings of AAAI-98 Workshop on "Learning for Text Categorization", AAAI Presss.Google Scholar
J. Rennie. 2001. Improving Multi-class Text Classification with Naive Bayes. Master's Thesis. M.I.T. AI Technical Report AITR-2001-004. 2001.Google Scholar
S. Scott and S. Matwin. 1999. Feature Engineering for Text Classification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML' 99), pp. 379--388. Google ScholarDigital Library
F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47. Google ScholarDigital Library
E. Stamatatos, N. Fakotakis and G. Kokkinakis. 2000. Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 26 (4), 471--495. Google ScholarDigital Library
W. Teahan and D. Harper. 2001. Using Compression-Based Language Models for Text Categorization. Proceedings of 2001 Workshop on Language Modeling and Information Retrieval.Google Scholar
P. Turney. 2002. Thumbs Up or Thumbs Down? Semantic Oritentation Applied to Unsupervised Classification of Reviews. Proceedings of 40th Annual Conference of Association for Computational Linguistics (ACL 2002) Google ScholarDigital Library
Y. Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1(1/2), pp. 67--88. Google ScholarDigital Library

Recommendations

Language independent, minimally supervised methods in natural language ambiguity resolution
Read More
Text independent root word identification in Hindi language using natural language processing

In this paper, an attempt is made to parse Hindi words to identify root word from an inflected word using natural language processing NLP technique. Stemming is a heuristic process that chops off the ends of words to find the root word and often ...
Read More
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization
ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics

Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian).In this work we present many ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
May 2003
293 pages
Program Chairs:
Marti Hearst,
Mari Ostendorf
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 27 May 2003
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate21of29submissions,72%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 766
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Language and task independent text categorization with simple language models

NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

ABSTRACT

References

Cited By

Recommendations

Language independent, minimally supervised methods in natural language ambiguity resolution

Text independent root word identification in Hindi language using natural language processing

Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Language and task independent text categorization with simple language models

NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

ABSTRACT

References

Cited By

Recommendations

Language independent, minimally supervised methods in natural language ambiguity resolution

Text independent root word identification in Hindi language using natural language processing

Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media