research-article

Tokenizing micro-blogging messages using a text classification approach

Authors:
Gustavo Laboreiro

LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal

LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal
View Profile

,
Luís Sarmento

Labs SAPO and LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal

Labs SAPO and LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal
View Profile

,
Jorge Teixeira

Labs SAPO and LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal

Labs SAPO and LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal
View Profile

,
Eugénio Oliveira

LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal

LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal
View Profile

AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text dataOctober 2010Pages 81–88https://doi.org/10.1145/1871840.1871853

Published:26 October 2010Publication History

AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Pages 81–88

ABSTRACT

The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages -- a task that is simple for human annotators -- and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier.

References

S. Acharyya, S. Negi, L. V. Subramaniam, and S. Roy. Unsupervised learning of multilingual short message service (sms) dialect from noisy examples. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 67--74, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
R. Ananthanarayanan, V. Chenthamarakshan, P. M. Deshpande, and R. Krishnapuram. Rule based synonyms for entity extraction from noisy text. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 31--38, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
B. Habert, G. Adda, M. Adda-Decker, P. B. de Marëuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek. Towards tokenization evaluation. In A. Rubio, N. Gallardo, R. Castro, and A. Tejada, editors. Proceedings First International Conference on Language Resources and Evaluation, volume I, pages 427--431, Granada, may 1998.Google Scholar
V. Jijkoun, M. A. Khalid, M. Marx, and M. de Rijke. Named entity normalization in user generated content. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 23--30, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 in Lecture Notes in Computer Science, pages 137--142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Google ScholarDigital Library
T. Kudo and Y. Matsumoto. Chunking with support vector machines. In NAACL '01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1--8, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library
K. Kukich. Techniques for automatically correcting words in text. ACM Comput. Surv., pages 377--439, New York, NY, USA, 1992. ACM. Google ScholarDigital Library
M. Mathioudakis and N. Koudas. Efficient identification of starters and followers in social media. In EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology, pages 708--719, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
G. Ngai and D. Yarowsky. Rule writing or annotation: Cost-efficient resource usage for base noun phrase chunking. In In Proceedings of ACL'02, pages 117--125, 2000. Google ScholarDigital Library
J. Read. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In ACL '05: Proceedings of the ACL Student Research Workshop, pages 43--48, Morristown, NJ, USA, 2005. Association for Computational Linguistics. Google ScholarDigital Library
H. Takeuchi, L. V. Subramaniam, S. Roy, D. Punjani, and T. Nasukawa. Sentence boundary detection in conversational speech transcripts using noisily labeled examples. Int. J. Doc. Anal. Recognit., 10(3):147--155, 2007. Google ScholarDigital Library
B. Tang, X. Wang, and X. Wang. Chinese word segmentation based on large margin methods. International Journal of Asian Language Processing, 19(2):55--68, 2009.Google Scholar
K. Tomanek, J. Wermter, and U. Hahn. Sentence and token splitting based on conditional random fields. In PACLING 2007 -- Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49--57. Melbourne, Australia, September 19--21, 2007. Melbourne: Pacific Association for Computational Linguistics, 2007.Google Scholar

Index Terms

Tokenizing micro-blogging messages using a text classification approach
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Read More
Morphologically Annotated Amharic Text Corpora
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the ...
Read More
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & Security

Part-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
October 2010
96 pages
ISBN:9781450303767
DOI:10.1145/1871840
Program Chairs:
Roberto Basili
University of Rome, Italy
,
Daniel Lopresti
Lehigh University, USA
,
Christoph Ringlstetter
University of Munich, Germany
,
Shourya Roy
Xerox India Innovation Hub, India
,
Klaus U. Schulz
University of Munich, Germany
,
L. Venkata Subramaniam
IBM Research, India
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
corpus
micro-blogging
text pre-processing
tokenization
twitter
user-generated content
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate15of22submissions,68%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 826
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Tokenizing micro-blogging messages using a text classification approach

AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Urdu text classification

Morphologically Annotated Amharic Text Corpora

A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus