Skip to main content

A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer

  • Conference paper
Systems and Frameworks for Computational Morphology (SFCM 2011)

Abstract

Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a web application, AraComLex (Arabic Computer Lexicon), for managing and curating the lexical database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dichy, J., Ali, F.: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? In: The MT-Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans (2003)

    Google Scholar 

  2. Attia, M.: An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks. In: Challenges of Arabic for NLP/MT Conference. The British Computer Society, London (2006)

    Google Scholar 

  3. Buckwalter, T.: Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue numberLDC2004L02,ISBN1-58563-324-0 (2004)

    Google Scholar 

  4. Beesley, K.R.: Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001. In: The ACL 2001 Workshop on Arabic Language Processing: Status and Prospects, Toulouse, France (2001)

    Google Scholar 

  5. Sinclair, J.M. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, London (1987)

    Google Scholar 

  6. Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Kulick, S.: LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. LDC Catalog No. LDC2010L01 (2010) ISBN: 1-58563-555-3

    Google Scholar 

  7. Bin-Muqbil, M.: Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. thesis in the University of Wisconsin, Madison (2006)

    Google Scholar 

  8. Watson, J.: The Phonology and Morphology of Arabic. Oxford University Press, New York (2002)

    Google Scholar 

  9. Elgibali, A., Badawi, E.M.: Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said M. Badawi. American University in Cairo Press, Egypt (1996)

    Google Scholar 

  10. Fischer, W.: Classical Arabic. In: The Semitic Languages. Routledge, London (1997)

    Google Scholar 

  11. Van Mol, M.: Variation in Modern Standard Arabic in Radio News Broadcasts, A Synchronic Descriptive Investigation in the use of complementary Particles. Leuven, OLA 117 (2003)

    Google Scholar 

  12. Stetkevych, J.: The modern Arabic literary language: lexical and stylistic developments. Publications of the Center for Middle Eastern Studies, vol. (6). University of Chicago Press, Chicago (1970)

    Google Scholar 

  13. Owens, J.: The Arabic Grammatical Tradition. In: The Semitic Languages. Routledge, London (1997)

    Google Scholar 

  14. Ghazali, S., Braham, A.: Dictionary Definitions and Corpus-Based Evidence in Modern Standard Arabic. In: Arabic NLP Workshop at ACL/EACL, Toulouse, France (2001)

    Google Scholar 

  15. Lane, E.W.: Preface. In: Arabic–English Lexicon. Williams and Norgate, London (1863)

    Google Scholar 

  16. Arberry, A.J.: Oriental essays: portraits of seven scholars. George Allen and Unwin, London (1960)

    Google Scholar 

  17. Wehr, H., Cowan, J.M.: Dictionary of Modern Written Arabic, pp. VII-XV. Spoken Language Services, Ithaca (1976)

    Google Scholar 

  18. Brill, M.: The Basic Word List of the Arabic Daily Newspaper. The Hebrew University Press Association, Jerusalem (1940)

    Google Scholar 

  19. Kuŏcera, H., Francis, W.N.: Computational Analysis of Present-Day American English. Brown University Press, Providence (1967)

    Google Scholar 

  20. Landau, J.M.: A Word Count of Modern Arabic Prose. American Council of Learned Societies, New York (1959)

    Google Scholar 

  21. Al-Sulaiti, L., Atwell, E.: The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11 (2006)

    Google Scholar 

  22. Hajič, J., Smrž, O., Buckwalter, T., Jin, H.: Feature-Based Tagger of Approximations of Functional Arabic Morphology. In: The 4th Workshop on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain (2005)

    Google Scholar 

  23. Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford (2008)

    Google Scholar 

  24. Van Mol, M.: The development of a new learner’s dictionary for Modern Standard Arabic: the linguistic corpus approach. In: Heid, U., Evert, S., Lehmann, E., Rohrer, C. (eds.) Proceedings of the Ninth EURALEX International Congress, Stuttgart, pp. 831–836 (2000)

    Google Scholar 

  25. Boudelaa, S., Marslen-Wilson, W.D.: Aralex: A lexical database for Modern Standard Arabic. Behavior Research Methods 42(2) (2010)

    Google Scholar 

  26. Beesley, K.R.: Arabic Morphological Analysis on the Internet. In: The 6th International Conference and Exhibition on Multilingual Computing, Cambridge, UK (1998)

    Google Scholar 

  27. Beesley, K.R., Karttunen, L.: Finite State Morphology: CSLI studies in computational linguistics. CSLI, Stanford (2003)

    Google Scholar 

  28. Kiraz, G.A.: Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge University Press, Cambridge (2001)

    Book  Google Scholar 

  29. Parker, R., Graff, D., Chen, K., Kong, J., Maeda, K.: Arabic Gigaword Fourth Edition. LDC Catalog No. LDC2009T30 (2009) ISBN: 1-58563-532-4

    Google Scholar 

  30. Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization. In: The 2nd International Conference on Arabic Language Resources and Tools (MEDAR 2009), Cairo, Egypt, pp. 102–109 (2009)

    Google Scholar 

  31. Habash, N., Rambow, O.: Arabic Tokenization, Morphological Analysis, and Part- of-Speech Tagging in One Fell Swoop. In: Proceedings of the Conference of American Association for Computational Linguistics (ACL 2005). The University of Michigan, Ann Arbor (2005)

    Google Scholar 

  32. Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In: Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio (2008)

    Google Scholar 

  33. Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., Soria, C.: Multilingual resources for NLP in the lexical markup framework (LMF). Language Resources and Evaluation (2008) ISSN 1574-020X

    Google Scholar 

  34. ISO 24613: Language Resource Management Lexical Markup Framework (draft version), ISO Switzerland (2007)

    Google Scholar 

  35. Chen, P.P.: The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976)

    Article  Google Scholar 

  36. Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington DC (1961)

    MATH  Google Scholar 

  37. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Englewood Cliffs (1998)

    MATH  Google Scholar 

  38. Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  39. Attia, M., Toral, A., Tounsi, L., Monachini, M., van Genabith, J.: An automatically built Named Entity lexicon for Arabic. In: LREC 2010, Valletta, Malta (2010)

    Google Scholar 

  40. Attia, M., Toral, A., Tounsi, L., Monachini, M.: van Genabith. Automatic Extraction of Arabic Multiword Expressions. In: COLING 2010 Workshop on Multiword Expressions: from Theory to Applications, Beijing, China (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Attia, M., Pecina, P., Toral, A., Tounsi, L., van Genabith, J. (2011). A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2011. Communications in Computer and Information Science, vol 100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23138-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23138-4_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23137-7

  • Online ISBN: 978-3-642-23138-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics