IADD: An integrated Arabic dialect identification dataset

Arabic language has different variants that can be roughly categorized into three main categories: Classical Arabic (CA), Modern Standard Arabic (MSA) and Dialectal Arabic (DA). There are subtle differences between MSA and CA in terms of syntax, terminology and pronunciation. However, Dialectal Arabic (DA) significantly differs from CA and MSA in that it reflects geographic location of the speaker, or at least the country of origin, if mobility factors are taken into account. This paper presents IADD, an Integrated dataset for Arabic dialect identification, that contains 135,804 texts representing Arabic dialects from 5 regions and 9 countries. IADD dataset is created, from the combination of subsets of five corpora, to support the task of automatic Arabic dialects detection.


Subject
Data Science Specific subject area The dataset relates to automatic dialect identification which is a natural language processing task that focuses on automatically detecting the dialect in which a text is written.

Value of the Data
• The proposed dataset not only covers different Arabic dialects but also different types of texts that are typically found on the web. IADD contains examples of tweets, Facebook posts and online comments, this aspect is important to generate classifiers that handle different types of textual content. • The dataset can benefit the natural language processing community as it can be used to build and compare classifiers that automatically predict the dialect expressed by a text written in Arabic. Digital data analysts can also use the dataset to infer the geographic origin of Arabicspeaking web users by identifying the dialect they use in their interactions online. • IADD might be also used to support corpus-based dialectometry and study geo-linguistic variations between Arabic dialects. • The proposed dataset can be valuable in at least two domains: • Marketing Analytics: The accurate identification of the demographic characteristics is crucial in audience analysis, this dataset represents a resource to support the automatic identification of the geographic origin of reviews and comments authors. • Public opinion disaggregation: Opinion mining (i.e. sentiment analysis) has been extensively used as a tool to gauge public opinion toward a given subject. Classical approaches are limited to polarity and objectivity analysis. With the proposed resource, opinions can be disaggregated by geographic location providing in-depth insight into public opinion and uncovering potential disparities within the community of Arabic-speaking web users.

Data Description
The objective was to build a diverse and large dataset with a wide coverage of dialects and types of textual content, which ensures a better generalization of classification models. Integrated Arabic Dialect Dataset (IADD) is created in two steps: (1) Data sources identification and 2) data preparation and insertion. At the end of the process, IADD is stored in a JSON-like format with the following keys: • Sentence : contains the sentence/ text;       Table 1 and Fig. 2 provide an overview of IADD, describing the number and percentage of sentences by region and country, and the vocabulary size. Average word count, characters count and number of stop words per sentence, for each regional dialect, are presented in Table 2 .
To give an overview of most frequent words for each regional dialect supported by IADD, word clouds featuring top 200 words are presented in Figs. 3, 4, 5, 6, 7 and 8 . Before plotting word clouds, a number of preprocessing steps have been conducted: 1. Letters normalization, 2. Digits and punctuation removal, 3. Latin characters removal, 4. Elongation removal, 5. Diacritics removal, 6. Stop words from modern standard Arabic were also removed while stop words that are dialect specific remained unchanged.     Venn diagrams, presented in Figs. 9 and 10 , show the numbers of common words between dialects' vocabularies ( Fig. 10 ) and between data sources' vocabularies ( Fig. 9 ). Figures show that there are 553 common words between dialects' vocabularies while data sources' vocabularies share 1573 words.

Data sources identification
IADD is created from the combination of subsets of five corpora: DART, SHAMI, TSAC, PADIC and AOC. Each corpus supports a different set of dialects, as shown in Table 3 . The Dialectal ARabic Tweets dataset (DART) [2] has about 25,0 0 0 tweets that are annotated via crowdsourcing while the SHAMI dataset [4] consists of 117,805 sentences and covers levantine dialects spoken in Palestine, Jordan, Lebanon and Syria. TSAC [5] is a Tunisian dialect corpus of 17,0 0 0 comments collected mainly from Tunisian Facebook pages. Parallel Arabic Dialect Corpus (PADIC) [3] is made of sentences transcribed from recordings or translated from MSA. Finally, the Arabic Online Commentary (AOC) dataset [1] is based on reader commentary from the online versions of three Arabic newspapers, and it consists of 1.4M comments.

Data preparation and insertion
Data preparation and insertion procedures, from each data source into IADD, are detailed below.

SHAMI and TSAC
Sentences from SHAMI and TSAC are directly inserted in IADD. Region is set to "LEV" for SHAMI data and to "MGH" for TSAC data.

DART
Regarding DART, besides the five groups of regional dialects (EGY, IRQ, GLF, LEV, MGH), it contains also an additional group named "Other". The items corresponding to the "Other" category are discarded and are therefore not added in IADD.
• ALGIERS and ANNABA are two cities in Algeria. These tags are used to distinguish sentences written in Annaba dialect from those written in Algiers dialect. • MODERN-STANDARD-ARABIC tag is associated to sentences written in MSA; • SYRIAN, PALESTINIAN and MOROCCAN are dialects corresponding to Syria, Palestine and Morocco, respectively.
ALGIERS, ANNABA and MOROCCAN represent dialects from Maghrebi region. Therefore, all sentences annotated as such are mapped to region value "MGH". Similarly, "LEV" is assigned to SYR-IAN and PALESTINIAN sentences. At last, sentences holding the MODERN-STANDARD-ARABIC tag are discarded. Aside from that, as PADIC is publicly available in the format of an XML file that contains Buckwalter 1 encoded sentence, every sentence is mapped to its Arabic version, before including it to IADD. Fig. 11 shows an example of a sentence before and after transformation.

AOC
Texts in AOC dataset have 3 annotations given by 3 different reviewers. Annotators judged each text and assigned, according to their judgment, one of the following labels: "notsure ", "junk ", "levantine ", "egyptian ", "gulf ", "'iraqi' ", "maghrebi ", "general " and "'msa' ". Only texts with at least two identical annotations are considered. From these, texts annotated as "msa ", "'junk' " or "notsure " are discarded, as sentences with the "msa " tag are in modern standard language and the two other tags are associated with noisy and ambiguous sentences, respectively. Fig. 12 presents an example of discarded texts and in Fig. 13 is an example of texts that are kept and included in IADD.

Declaration of Competing Interest
The author declares that there is no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.