ABSTRACT
The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages -- a task that is simple for human annotators -- and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier.
- S. Acharyya, S. Negi, L. V. Subramaniam, and S. Roy. Unsupervised learning of multilingual short message service (sms) dialect from noisy examples. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 67--74, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- R. Ananthanarayanan, V. Chenthamarakshan, P. M. Deshpande, and R. Krishnapuram. Rule based synonyms for entity extraction from noisy text. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 31--38, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- B. Habert, G. Adda, M. Adda-Decker, P. B. de Marëuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek. Towards tokenization evaluation. In A. Rubio, N. Gallardo, R. Castro, and A. Tejada, editors. Proceedings First International Conference on Language Resources and Evaluation, volume I, pages 427--431, Granada, may 1998.Google Scholar
- V. Jijkoun, M. A. Khalid, M. Marx, and M. de Rijke. Named entity normalization in user generated content. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 23--30, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 in Lecture Notes in Computer Science, pages 137--142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Google ScholarDigital Library
- T. Kudo and Y. Matsumoto. Chunking with support vector machines. In NAACL '01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1--8, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarDigital Library
- K. Kukich. Techniques for automatically correcting words in text. ACM Comput. Surv., pages 377--439, New York, NY, USA, 1992. ACM. Google ScholarDigital Library
- M. Mathioudakis and N. Koudas. Efficient identification of starters and followers in social media. In EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology, pages 708--719, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- G. Ngai and D. Yarowsky. Rule writing or annotation: Cost-efficient resource usage for base noun phrase chunking. In In Proceedings of ACL'02, pages 117--125, 2000. Google ScholarDigital Library
- J. Read. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In ACL '05: Proceedings of the ACL Student Research Workshop, pages 43--48, Morristown, NJ, USA, 2005. Association for Computational Linguistics. Google ScholarDigital Library
- H. Takeuchi, L. V. Subramaniam, S. Roy, D. Punjani, and T. Nasukawa. Sentence boundary detection in conversational speech transcripts using noisily labeled examples. Int. J. Doc. Anal. Recognit., 10(3):147--155, 2007. Google ScholarDigital Library
- B. Tang, X. Wang, and X. Wang. Chinese word segmentation based on large margin methods. International Journal of Asian Language Processing, 19(2):55--68, 2009.Google Scholar
- K. Tomanek, J. Wermter, and U. Hahn. Sentence and token splitting based on conditional random fields. In PACLING 2007 -- Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49--57. Melbourne, Australia, September 19--21, 2007. Melbourne: Pacific Association for Computational Linguistics, 2007.Google Scholar
Index Terms
- Tokenizing micro-blogging messages using a text classification approach
Recommendations
Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information TechnologyThis paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Morphologically Annotated Amharic Text Corpora
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalIn information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the ...
A Comparative Study on the Efficiency of POS Tagging Techniques on Amazigh Corpus
NISS '19: Proceedings of the 2nd International Conference on Networking, Information Systems & SecurityPart-of-speech (POS) tagging is a fundamental task of Natural Language Processing (NLP). It provides useful information for many other NLP tasks, including word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic ...
Comments