Abstract
Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a web application, AraComLex (Arabic Computer Lexicon), for managing and curating the lexical database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dichy, J., Ali, F.: Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? In: The MT-Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans (2003)
Attia, M.: An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks. In: Challenges of Arabic for NLP/MT Conference. The British Computer Society, London (2006)
Buckwalter, T.: Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue numberLDC2004L02,ISBN1-58563-324-0 (2004)
Beesley, K.R.: Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001. In: The ACL 2001 Workshop on Arabic Language Processing: Status and Prospects, Toulouse, France (2001)
Sinclair, J.M. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, London (1987)
Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Kulick, S.: LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. LDC Catalog No. LDC2010L01 (2010) ISBN: 1-58563-555-3
Bin-Muqbil, M.: Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. thesis in the University of Wisconsin, Madison (2006)
Watson, J.: The Phonology and Morphology of Arabic. Oxford University Press, New York (2002)
Elgibali, A., Badawi, E.M.: Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said M. Badawi. American University in Cairo Press, Egypt (1996)
Fischer, W.: Classical Arabic. In: The Semitic Languages. Routledge, London (1997)
Van Mol, M.: Variation in Modern Standard Arabic in Radio News Broadcasts, A Synchronic Descriptive Investigation in the use of complementary Particles. Leuven, OLA 117 (2003)
Stetkevych, J.: The modern Arabic literary language: lexical and stylistic developments. Publications of the Center for Middle Eastern Studies, vol. (6). University of Chicago Press, Chicago (1970)
Owens, J.: The Arabic Grammatical Tradition. In: The Semitic Languages. Routledge, London (1997)
Ghazali, S., Braham, A.: Dictionary Definitions and Corpus-Based Evidence in Modern Standard Arabic. In: Arabic NLP Workshop at ACL/EACL, Toulouse, France (2001)
Lane, E.W.: Preface. In: Arabic–English Lexicon. Williams and Norgate, London (1863)
Arberry, A.J.: Oriental essays: portraits of seven scholars. George Allen and Unwin, London (1960)
Wehr, H., Cowan, J.M.: Dictionary of Modern Written Arabic, pp. VII-XV. Spoken Language Services, Ithaca (1976)
Brill, M.: The Basic Word List of the Arabic Daily Newspaper. The Hebrew University Press Association, Jerusalem (1940)
Kuŏcera, H., Francis, W.N.: Computational Analysis of Present-Day American English. Brown University Press, Providence (1967)
Landau, J.M.: A Word Count of Modern Arabic Prose. American Council of Learned Societies, New York (1959)
Al-Sulaiti, L., Atwell, E.: The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11 (2006)
Hajič, J., Smrž, O., Buckwalter, T., Jin, H.: Feature-Based Tagger of Approximations of Functional Arabic Morphology. In: The 4th Workshop on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain (2005)
Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford (2008)
Van Mol, M.: The development of a new learner’s dictionary for Modern Standard Arabic: the linguistic corpus approach. In: Heid, U., Evert, S., Lehmann, E., Rohrer, C. (eds.) Proceedings of the Ninth EURALEX International Congress, Stuttgart, pp. 831–836 (2000)
Boudelaa, S., Marslen-Wilson, W.D.: Aralex: A lexical database for Modern Standard Arabic. Behavior Research Methods 42(2) (2010)
Beesley, K.R.: Arabic Morphological Analysis on the Internet. In: The 6th International Conference and Exhibition on Multilingual Computing, Cambridge, UK (1998)
Beesley, K.R., Karttunen, L.: Finite State Morphology: CSLI studies in computational linguistics. CSLI, Stanford (2003)
Kiraz, G.A.: Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge University Press, Cambridge (2001)
Parker, R., Graff, D., Chen, K., Kong, J., Maeda, K.: Arabic Gigaword Fourth Edition. LDC Catalog No. LDC2009T30 (2009) ISBN: 1-58563-532-4
Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization. In: The 2nd International Conference on Arabic Language Resources and Tools (MEDAR 2009), Cairo, Egypt, pp. 102–109 (2009)
Habash, N., Rambow, O.: Arabic Tokenization, Morphological Analysis, and Part- of-Speech Tagging in One Fell Swoop. In: Proceedings of the Conference of American Association for Computational Linguistics (ACL 2005). The University of Michigan, Ann Arbor (2005)
Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In: Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio (2008)
Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., Soria, C.: Multilingual resources for NLP in the lexical markup framework (LMF). Language Resources and Evaluation (2008) ISSN 1574-020X
ISO 24613: Language Resource Management Lexical Markup Framework (draft version), ISO Switzerland (2007)
Chen, P.P.: The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976)
Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington DC (1961)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall, Englewood Cliffs (1998)
Hulden, M.: Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Association for Computational Linguistics, Stroudsburg (2009)
Attia, M., Toral, A., Tounsi, L., Monachini, M., van Genabith, J.: An automatically built Named Entity lexicon for Arabic. In: LREC 2010, Valletta, Malta (2010)
Attia, M., Toral, A., Tounsi, L., Monachini, M.: van Genabith. Automatic Extraction of Arabic Multiword Expressions. In: COLING 2010 Workshop on Multiword Expressions: from Theory to Applications, Beijing, China (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Attia, M., Pecina, P., Toral, A., Tounsi, L., van Genabith, J. (2011). A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2011. Communications in Computer and Information Science, vol 100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23138-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-23138-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23137-7
Online ISBN: 978-3-642-23138-4
eBook Packages: Computer ScienceComputer Science (R0)