Abstract
Language usage over computer mediated discourses, such as chats, emails and SMS texts, significantly differs from the standard form of the language and is referred to as texting language (TL). The presence of intentional misspellings significantly decrease the accuracy of existing spell checking techniques for TL words. In this work, we formally investigate the nature and type of compressions used in SMS texts, and develop a Hidden Markov Model based word-model for TL. The model parameters have been estimated through standard machine learning techniques from a word-aligned SMS and standard English parallel corpus. The accuracy of the model in correcting TL words is 57.7%, which is almost a threefold improvement over the performance of Aspell. The use of simple bigram language model results in a 35% reduction of the relative word level error rates.
Similar content being viewed by others
References
Andersen J.M. (1973). Structural Aspects of Language Change. Longmans, London
Androutopoulos, J., Schimdt, G.: SMS-kommunikation: etnografische gattungsanaslyse am beispeil einer kleingruppe. Zeitschrift fü Angewandte Linguistik (2001)
Atkinson, K.: Gnu Aspell. http://aspell.sourceforge.net/ (2005)
Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 33–40. ACL, Sydney (2006)
Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of COLING-2002 (2002)
Baron N.S. (1984). Computer mediated communication as a force in language change. Vis. Lang. 18(2): 118–141
Baron N.S. (1998). Letters by phone or speech by other means: the linguistics of e-mail. Lang. Comm. 18: 133–170
Baron N.S. (2000). Alphabet to e-mail: How written English evolved and where it’s heading. Routledge, London
Bigram: Frequency list. http://www.clg.wlv.ac.uk/projects/style/corpus
Boersma, P.: Sound change in functional phonology. Technical Report (1997). URL Rutgers Optimality Archive. http://ruccs.rutgers.edu/roa.html
Boersma P. (1998). Functional Phonology: Formalizing the interactions between articulatory and perceptual drives. Uitgave van Holland Academic Graphics, Hague, Netherlands
Brill, E., Moore, R.C.: An improved error model for noisy spelling correction. In: Proceedings of the 38th Annual Meeting of the ACL, pp. 286–293. ACL (2000)
Brown P.F., Pietra S.A.D., Pietra V.J.D. and Mercer R.L. (1993). The mathematics of statistical machine translation: parameter estimation. Comput. Linguistics 19(2): 263–312
Choudhury, M.: Word-aligned SMS-standard English parallel corpus. http://www.cel.iitkgp.ernet.in/~monojit/sms.html
Choudhury, M., Basu, A., Sarkar, S.: A diachronic approach for schwa deletion in Indo-Aryan languages. In: Proceedings of the 7th Meeting of the ACL SIGPHON, pp. 20–26 (2004)
Crystal D. (2001). Language and the Internet. CUP, Cambridge
Damerau F.J. (1964). A technique for computer detection and correction of spelling errors. Commun. ACM 7(3): 171–176
Döring, N.: Kurzm wird gesendet—abkürzungen und akronyme in der SMS-kommunikation. Muttersprache Vierteljahresschrift für deutsche Sprache 2 (2002)
Ferrara K., Brunner H. and Whittemore G. (1990). Interactive written discourse as an emergent register. Writt. Commun. 8: 8–34
Fisher, W.M.: A statistical text-to-phone function using ngrams and rules. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 649–652. IEEE, New York (1999)
Fraser, A., Marcu, D.: Getting the structure right for word alignment: leaf. In: Proceedings of the 2007 Joint Conference on Empirical Methods in natural Language Processing and Computational natural Language Learning, pp. 51–60. ACL, Prague (2007)
Herring S.C. (2001). Computer-mediated discourse. In: Tannen, D., Schiffrin, D. and Hamilton, H. (eds) Handbook of Discourse Analysis, pp 612–634. Blackwell, Oxford
Jelinek F. (1997). Statistical Methods for Speech Recognition. MIT Press, Cambridge
Jurafsky D. and Martin J.H. (2000). An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, Englewood cliffs
Kernighan, M.D., Church, K.W., Gale, W.A.: A spelling correction program based on a noisy channel model. In: Proceedings of COLING, pp. 205–210. ACL, NJ (1990)
Kukich K. (1992). Technique for automatically correcting words in text. ACM Comput. Surv. 24: 377–439
Levenshtein V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10: 707–710
Lindblom B. (1998). Systemic constraints and adaptive change in the formation of sound structure. In: Hurford, J.R. and Studdert-Kennedy, M.K.C (eds) Approaches to the Evolution of Language: Social and Cognitive Bases., pp. Cambridge University Press, Cambridge
Mayes F. and Damerau F.J. (1991). Context based spelling correction. Inform. Process. Manage. 27(5): 517–522
Meillet A. (1967). The comparative method in historical linguistics. Champion, Paris
Mihov S. and Schulz K.U. (2004). Fast approximate search in large dictionaries. Comput. Lingu. 30(4): 451–477
Murray D. (1985). Composition as conversation: the computer terminal as medium of communication. In: Odell, L. and Goswami, D. (eds) Writing in Nonacademic Settings, pp 203–228. The Guilford Press, New York
Ney H., Mergel D., Noll A. and Paesler A. (1992). Data-driven search organisation for continuous speech recognition. IEEE Trans. Sig. Process 40: 272–281
Nishimura, Y.: Linguistic innovations and interactional features of casual online communication in Japanese. J. Comput. Med. Commun. 9(1) (2003)
Odell, M.K., Russell, R.C.: U.S. patent number 1,261,167 (1918)
Odell, M.K., Russell, R.C.: U.S. patent number 1,435,663 (1922)
Palfreyman, D., al Khalil, M.: A funky language for teenzz to use: representing Gulf Arabic in instant messaging. J. Comput. Med. Commun. 9(1) (2003)
Philips, L.: The double metaphone search algorithm. C/C++ Users J. (2000). http://www.ddj.com/dept/cpp/184401251
Rabiner L.R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2): 257–286
Ringlstetter, C., Schulz, K.U., S.M.: Orthographic errors in web ages: towards a cleaner web corpora. Comput. Lingu. 32(3), 295–340 (2006)
af Segerstad, Y.H.: Use and adaptation of written language to the conditions of computer-mediated communication. Ph.D. thesis, Department of Linguistics, Göteborg University Sweden (2002)
Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise—noise as genre. In: Proceedings of IJCAI-07 Workshop on Analytics for Noisy Unstructured Text Data (AND-07), pp. 9–16 (2007)
Taylor, P.: Hidden markov models for grapheme to phoneme conversion. In: Proceedings of 9th European Conference on Speech Communication and Technology—Interspeech, pp. 1973–1976 (2005)
Toutanova, K., Moore, R.C.: Pronunciation modeling for improved spelling correction. In: Proceedings of the 40th Annual Meeting of the ACL, pp. 144–151. ACL (2002)
Unigram: Frequency list. http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html
Wikipedia: Texting language. http://en.wikipedia.org/wiki/Texting_language
Xia, Y., Wong, K.F., Li, W.: A phonetic-based approach to Chinese chat text normalization. In: Proceedings of the COLING-ACL’06, ACL (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Choudhury, M., Saraf, R., Jain, V. et al. Investigation and modeling of the structure of texting language. IJDAR 10, 157–174 (2007). https://doi.org/10.1007/s10032-007-0054-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-007-0054-0