skip to main content
10.1145/1871840.1871853acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Tokenizing micro-blogging messages using a text classification approach

Authors Info & Claims
Published:26 October 2010Publication History

ABSTRACT

The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages -- a task that is simple for human annotators -- and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier.

References

  1. S. Acharyya, S. Negi, L. V. Subramaniam, and S. Roy. Unsupervised learning of multilingual short message service (sms) dialect from noisy examples. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 67--74, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Ananthanarayanan, V. Chenthamarakshan, P. M. Deshpande, and R. Krishnapuram. Rule based synonyms for entity extraction from noisy text. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 31--38, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Habert, G. Adda, M. Adda-Decker, P. B. de Marëuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek. Towards tokenization evaluation. In A. Rubio, N. Gallardo, R. Castro, and A. Tejada, editors. Proceedings First International Conference on Language Resources and Evaluation, volume I, pages 427--431, Granada, may 1998.Google ScholarGoogle Scholar
  4. V. Jijkoun, M. A. Khalid, M. Marx, and M. de Rijke. Named entity normalization in user generated content. In AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pages 23--30, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 in Lecture Notes in Computer Science, pages 137--142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Kudo and Y. Matsumoto. Chunking with support vector machines. In NAACL '01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1--8, Morristown, NJ, USA, 2001. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Kukich. Techniques for automatically correcting words in text. ACM Comput. Surv., pages 377--439, New York, NY, USA, 1992. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Mathioudakis and N. Koudas. Efficient identification of starters and followers in social media. In EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology, pages 708--719, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Ngai and D. Yarowsky. Rule writing or annotation: Cost-efficient resource usage for base noun phrase chunking. In In Proceedings of ACL'02, pages 117--125, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Read. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In ACL '05: Proceedings of the ACL Student Research Workshop, pages 43--48, Morristown, NJ, USA, 2005. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Takeuchi, L. V. Subramaniam, S. Roy, D. Punjani, and T. Nasukawa. Sentence boundary detection in conversational speech transcripts using noisily labeled examples. Int. J. Doc. Anal. Recognit., 10(3):147--155, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Tang, X. Wang, and X. Wang. Chinese word segmentation based on large margin methods. International Journal of Asian Language Processing, 19(2):55--68, 2009.Google ScholarGoogle Scholar
  13. K. Tomanek, J. Wermter, and U. Hahn. Sentence and token splitting based on conditional random fields. In PACLING 2007 -- Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 49--57. Melbourne, Australia, September 19--21, 2007. Melbourne: Pacific Association for Computational Linguistics, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Tokenizing micro-blogging messages using a text classification approach

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
      October 2010
      96 pages
      ISBN:9781450303767
      DOI:10.1145/1871840

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate15of22submissions,68%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader