Syntactic- and morphology-based text augmentation framework for Arabic sentiment analysis

Arabic language is a challenging language for automatic processing. This is due to several intrinsic reasons such as Arabic multi-dialects, ambiguous syntax, syntactical flexibility and diacritics. Machine learning and deep learning frameworks require big datasets for training to ensure accurate predictions. This leads to another challenge faced by researches using Arabic text; as Arabic textual datasets of high quality are still scarce. In this paper, an intelligent framework for expanding or augmenting Arabic sentences is presented. The sentences were initially labelled by human annotators for sentiment analysis. The novel approach presented in this work relies on the rich morphology of Arabic, synonymy lists, syntactical or grammatical rules, and negation rules to generate new sentences from the seed sentences with their proper labels. Most augmentation techniques target image or video data. This study is the first work to target text augmentation for Arabic language. Using this framework, we were able to increase the size of the initial seed datasets by 10 folds. Experiments that assess the impact of this augmentation on sentiment analysis showed a 42% average increase in accuracy, due to the reliability and the high quality of the rules used to build this framework.


INTRODUCTION
Arabic language is considered the most widely spoken language among the Semitic languages (Weninger et al., 2011;Al-Huri, 2015).It is also one of the popular languages in the world.As the statistical studies in 2019 mentioned (Summary by Language Size, 2020), Arabic language is spoken by nearly 319 million people and is ranked the fifth between the world's languages after Chinese, Spanish, English and Urdu\Indian.Arabic native speakers are distributed throughout the Arab World as well as many other nearby areas.Also, Arabic has around 30 modern varieties or dialects; one of them is the standard form Modern Standard Arabic (MSA) (ISO 639, 2020).In 2012, the United Nations Economic and Social Commission for West Asia mentioned that Arabic language has achieved the highest growth rate on the Internet compared to other languages.Therefore, recently digital Arabic content on the internet became fairly large.However, this does not Arabic syntax, grammar and morphology to create new sentences with the same labels or opposite labels as explained in "Description of Framework".
The syntax of Arabic language is complex (Kevin, 2001)-as several matching cases are possible between words in the same sentence, while in addition, each word has several synonyms.Therefore, it is possible to generate tens of variants for an Arabic sentence while preserving its meaning.This task can be automated if the system is able to parse the statement and link it to lexical resources.Parsing is the process where each word in the text is labeled with its part of the speech tag (Verb, Object, Subject, etc.).However, parsing is not a simple process especially for Arabic language where the structure and order of the words are not specified.The Natural Language Processing Group at Stanford University has built an open-source parser (The Stanford Natural Language Processing Group, 2018).Stanford Parser provides a set of natural language processing functions.Mainly, it was built for English; later on, many developers have carried out extensive work to improve the code and the grammatical rules to make it more comprehensive.As a result, this parser has been extended to include languages other than English, such as Chinese, German, Italian and Arabic.The parsing tool takes a text file as input and generates the base forms of words, normalizes and interprets dates, times and numeric quantities.Finally, it analyzes the grammatical structure of the sentences.The output of the parsing process can be presented in several forms, such as phrase structures, trees, or dependencies.
For building the framework, initially the Stanford Arabic Parser was used to generate the parse trees of Arabic sentences.Afterwards, the augmentation rules generated were used on these trees, to generate several equivalent parse trees for the original sentences utilizing Arabic morphology, syntax, synonyms and negation particles.These augmentation rules can be broadly divided into: (1) rules which alter or swap branches of the parse trees as per Arabic syntax and thus generate new sentences with the same labels; (2) rules which generate new parse trees by utilizing the synonyms of words in these sentences, and also generate new sentences with the same original labels; (3) rules which insert negation particles into the sentences and thus generate new sentences with opposite labels.It is worth mentioning here that the work in this paper addresses text augmentation for sentiment analysis.This means that the labels of the investigated sentences are either neutral, positive or negative.Applying the sets of rules described in (1) or (2) above will generate new sentences with the same labels as the input sentences.By comparison, applying the set of rules described in (3) as aforementioned, generates new sentences with opposite labels to the input sentences.Experiments proved the viability and effectiveness of the augmentation framework by running three experiments using three datasets.The size of the original datasets substantially increased and the generated sentences were of high quality.
The rest of this article is organized as follows: "Related Work" briefly describes the related literature works."Arabic Language Properties" explains the properties of Arabic language."Description of Framework" explains the design of the transformation rules which are the core of the augmentation framework."Negation" describes the implementation of the framework."Evaluation" demonstrates the experiments which were carried out to assess the effectiveness of the proposed work.And finally "Conclusion" summarizes the conclusion of this work.

RELATED WORK
This section describes related studies which have utilized Arabic WordNet as a component of frameworks.It also describes the related work which addresses data augmentation.

Arabic WordNet
WordNet (Miller et al., 1990) is a large linguistic database, or hierarchical dictionary, which was initially developed for the English language.It has been very useful for the fields of computational linguistics and Natural Language Processing (Miller & Fellbaum, 2007).Because of its structure, the WordNet differs from other standard dictionaries, where it groups words based on their meanings.The English WordNet lexicon (Miller, 1995) is divided into syntactic categories that contain (nouns, verbs, adjectives and adverbs).It should be noted here that function words are deleted.However, WordNet grouped synonyms using the meaning (thesaurus) rather than the form (dictionaries).It also represents words redundantly-where a given word may appear in noun, verb and adverb syntactic categories.The WordNet consists of four parts (Miller et al., 1990): (1) lexicographers source files; (2) the tool to convert these files into the lexical database; (3) the lexical database; (4) software tools that are used to access the database.
WordNet has been very useful as it was used to build many Natural Language Processing applications, Information Retrieval, term expansion and document representations (Fellbaum & Vossen, 2007).For example, Varelas et al. (2005) compared the performance of using single ontology and different ontologies for the semantic similarity methods.Single ontology experiments were performed using the WordNet and it showed better performance in the results.
However, many efforts have been reported to adapt WordNet for other languages, such as WordNets for European languages (Vossen, 2004) and French and Slavonian WordNets (Sagot & Fišer, 2021).By comparison, Arabic WordNet (Elkateb et al., 2006) used the same development approach for word representation of Princeton WordNet to keep it compatible with other Word-Nets' structures.Arabic WordNet is a lexical database for MSA, with two main linguistic categories (verbs and nouns).First, the important concepts that represent the core WordNet were extracted, then specific concepts for the Arabic language were developed along with other concepts that were manually translated to the most convenient synset from other languages.It was developed using MySQL and XML (Elkateb et al., 2006).Finally, the Arabic WordNet ended up with 11,270 synsets (2,538 verbs, 7,961 nominal, 110 adverbs and 661 adjectives) with 23,496 Arabic expressions.Table 1 presents detailed information about the statistical properties of Arabic WordNet.
Several researchers have targeted extending Arabic WordNet.For example, in the work reported in Alkhalifa & Rodríguez (2009, 2010), the authors automatically extracted named entities from Arabic Wikipedia.Subsequently, they attached these entities as instances to the synsets of Arabic WordNet and finally created a link to their counterparts in English WordNet.Moreover, Badaro, Hajj & Habash (2020) introduced an automatic method for expanding Arabic WordNet-where they formulated the problem as a link prediction problem.Shoaib et al. (2009) used the relationships in Arabic WordNet in order to build a model for semantic search in the Holy Quran.The proposed model improved searching and retrieving of the related verses from the Holy Quran without mentioning a specific keyword in the query.The model works in two stages.Namely, it identifies one sense of the query word using Word Sense Disambiguation, then it extracts out all the synonyms of the identified sense of the word.AlMaayah, Sawalha & Abushariah (2016) have also worked on the Holy Quran, where the researchers have built a model that extracts the synonyms and builds the Quranic Arabic WordNet.This net was built based on the Boundary Annotated Quran Corpus, lexicon resources, and traditional Arabic dictionaries.The final model was able to link the Holy Quran words that have the same meaning and generate sets of synsets using the vector space model.The Quranic Arabic WordNet has 6,918 synsets from 8,400 unique word senses.In other studies, the researches have tried to extract semantic relationships between words, and provide models to represent ontological relations for the Arabic content on the internet.These representations are useful to facilitate the analyses and processing of Arabic text.Al Zamil & Al-Radaideh (2014) have used the semantic features that were extracted from the text along with syntactic patterns of relationships to provide models that are able to automate the process of ontological relations extraction.The extracted features are used to construct generalized rules which were used to build a classifier.The classifier presents each concept with its designated relationship label.

Data augmentation
Data augmentation is a technique that is used to increase the size of datasets and preserve the labels at the same time.It became popular with deep learning networks as they require training on huge datasets to secure high accuracies (Krizhevsky, Sutskever & Hinton, 2012;Szegedy et al., 2015;Jaitly & Hinton, 2013;Ko et al., 2015).Extending the size (number of samples) in a dataset, especially for under-represented classes, is mainly depended on generating perturbed replicas of the class samples.This technique has proved its success in image classification such as the work reported in Krizhevsky, Sutskever & Hinton (2012), Tran et al. (2017) and Irsheidat & Duwairi (2020); 3D pose estimation as reported in Rogez & Schmid (2016); speaker language identification as described in Duwairi and Abushaqra (2021), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.469Keren et al. (2016); recognition of audio-visual effect (Tzirakis et al., 2017); and the classification of the environmental sound (Salamon & Bello, 2017).
On the other hand, data augmentation is limited when dealing with textual data.This is due to the very difficult definition and standardization of specific rules or transformations that preserve the meaning of the produced textual data (Kobayashi, 2018).Basically, the main approach that works to increase the size of textual data, and preserves text meaning, is to use the synonyms of words, relying on lexical resources such as WordNet.
The works reported in Zhang, Zhao & LeCun (2015) and Wang & Yang (2015) have used a synonyms-based approach for augmenting textual data.As the synonyms are very limited, the proposed sentences are not very different and numerous from the original texts.Therefore, Kobayashi (2018) has proposed the contextual augmentation method, which is a state-of-the-art method to augment words, and produce more varied sentences.The author used words predicted by the bidirectional language model (LM) instead of using synonyms.The proposed approach was able to present a wide range of substitute words and it has been tested with two classifiers using recurrent or convolutional neural networks where it improves the overall performance.Rizos, Hemker & Schuller (2019) targeted extending a text used for hate speech detection relying on synonyms lists, wrapping the word token around the padded sequence, and finally applying class-based conditional recurrent neural language generation.The authors state that they achieved a 5.7% increase on Macro-F1 and a 30% in recall when extending the datasets using their three text extensions methods.
The work reported in Sharifirad, Jafarpour & Matwin (2018) has described a framework for augmenting tweets based on ConceptNet and Wikidata.The authors suggested two methods for improving the quality of tweets by first appending terms extracted from ConceptNet and Wikidata to the existing tweets but not increasing their numbers.Secondly, they generated new tweets by replacing words or terms in the original tweets with terms extracted from ConceptNet and Wikidata.This approach is close to the approaches which utilize synonyms.
In a similar study, Kolomiyets, Bethard & Moens (2011) replaced the headwords with a substitute word predicted from the Latent Words in the language model.The authors only used the top k score words as a substitute.Mueller & Thyagarajan (2016) substituted random words in sentences with their synonyms to generate new sentences.Subsequently, they trained a siamese recurrent network to compute the similarity between sentences.Wang & Yang (2015) employed word embedding to increase the size of the training data.Specifically, they replaced a given word with its nearest neighbor word vector.
As it can be seen from the above literature, most of the existing augmentation techniques address image or audio data and less work addresses text augmentation.In this regard, it should be mentioned that no work addresses Arabic text augmentation.The current proposed framework is substantially different from text augmentation which relies on the replacement of words by their synonyms.On the other hand, it utilizes the rich syntax and grammar of the Arabic language in order to generate transformation rules, that are subsequently used to generate new sentences based on seed sentences.

ARABIC LANGUAGE PROPERTIES
Arabic language is one of the Semitic languages.It consists of 28 basic letters.Several Arabic letters change their shapes based on their location in the word.For example, the letter ‫)ﺱ(‬ has the shape ( ‫ﺳ‬ ‫ـ‬ ‫ـ‬ ) when it is located at the beginning of the word, the shape ( ‫ـ‬ ‫ﺴ‬ ‫ـ‬ ) when it is located at the middle of the word, ( ‫ـ‬ ‫ﺲ‬ ) when it is located at the end of the word but connected to the previous letter, and ‫)ﺱ(‬ when it comes at the end of the word but disconnected from the previous letter.Arabic is an inflectional language that is written from right to left.The following three subsections provide background about Arabic language.

Arabic morphology
Morphology is the structure of words.The morphology of Arabic language is complex but systematic-where there are two ways to build a word in Arabic: derivation and agglutination.The derivation is a way of generating stems from a list of roots; based on three basic letters ( ) for trilateral roots.For example, by using the root word " ‫ﺩ‬ ‫ﺭ‬ ‫ﺱ‬ " that rhymes with " ‫ﻓ‬ ‫ﻌ‬ ‫ﻞ‬ " one can generate the following stems: The second way to build words in Arabic language is agglutination.In this way, the words are built by adding affixes to the word.These affixes could be prefixes at the beginning of words such as ( ), infixes in the middle of the word (such as ‫,)ﺍ‬ or suffixes at the end of the word such as ( ).

Arabic syntax
In Arabic scripts, the sentence has two types or categories (nominal and verbal).Each type has its own grammar and rules.The nominal sentence, in Arabic, consists of a subject (Almubtada) and predicate (Alkhabar).The normal order is that the subject is followed by the predicate but in certain cases, it is allowed to swap them (e.g. the sentence " The subject in the nominal sentence can be Noun, Pronoun or Number while the predicate can be Singular Noun, Adverb, Preposition, Nominal sentence, or Verbal Sentence. The verbal sentence in Arabic, like in many other languages, consists of Verb (V), Subject (S) and Object (O) without a specific order, which means that the order of verbal sentences could be: VSO, VOS, SVO or VOS.Additionally, in Arabic language diacritics, prefixes and suffixes are used to represent gender.Therefore, the absence of diacritics can create ambiguity and might change the meaning.

Diacritics
One of the Arabic language features is the diacritics that are written above or underneath its letters.Diacritics are small vowel marks that represent three short vowels (a, i, u).They are used to regulate and control the letters and pronunciation.Therefore, diacritics have a huge effect on the text and its meaning, removing them may lead to morphologicallexical and morphological-syntactical ambiguities.For example, the word (nEm) ( ‫ﻧ‬ ‫ﻌ‬ ‫ﻢ‬ ) has the meaning 'Yes' if it was written (naEom ‫ﻧ‬ َ ‫ﻌ‬ ْ ‫ﻢ‬ ), while it means 'graces' if it was written (niEm ‫ﻧ‬ ِ ‫ﻌ‬ ‫ﻢ‬ ).The basic diacritics of Arabic language are: Fatha: symbolized as an italic score on the top of the letter.
Dma: symbolized as a small ‫)ﻭ(‬ letter on the top of the letter.
Ksra: symbolized as an italic underscore on the bottom of the letter.
Sokon: symbolized as a small circle on the top of the letter.

TRANSFORMATION RULES DEFINITION
As a first step, clear definitions of Arabic grammar rules were specified.These rules include specifications for nominal sentences, verbal sentences, questions, verbs, adjectives, pronouns, prepositions, conjunctions and numbers.These defined grammar-based rules were represented using the Stanford Arabic parser tagset.Table 2 lists these tags in full details.
Table 3, on the other hand, summarizes the core concepts of this research-it depicts, in the second column, grammar rules for valid sentences in Arabic.The third column of Table 3 lists equivalent grammar rules which were derived from the original rules listed in the second column.The importance of these rules is that sentences that respect the grammar rules listed in the second column could be mapped to new sentences which fulfill the grammar rules listed in column 3, and still have the same label for the classifiers.The following statements show example sentences from Arabic which respect grammar rules in Table 3     additional Arabic components.However, the performance of the Arabic WordNet is not satisfactory when compared with other WordNets.For example, the Arabic WordNet contains only 9.7% of the Arabic lexicon, while the English WordNet covers 67.5% of the English lexicon.Also, the Arabic WordNet synsets are linked only through hyponymy, synonymy and equivalence; correspondingly seven semantic relations are used in the English WordNet.However, since the main goal is generating the synonyms of the words, the limitation of the Arabic WordNet did not substantially affect the work.Also, to avoid the noise caused by diacritics, only the first top five synsets in each synonyms list were considered.Table 5 shows the first eight synsets for the Arabic word "Man -‫ﺭ‬ ‫ﺟ‬ ‫ﻞ‬ ".As it can be seen from Table 5, the further we go deeper in generating synonyms, the higher the chance of generating wrong synonyms.The last two entries in Table 5 correspond to "leg and foot" and not "Man".

Experiment 1: classification of sentiment towards products
The aim of this experiment is to classify product reviews into positive, negative or neutral reviews.The focus of this experiment is not the classifier, but to assess the resulting accuracy of using the proposed framework when enlarging the size of the dataset.To perform the first experiment, we used a subset of a public dataset of product reviews (ElSahar & El-Beltagy, 2015) which contains 300 reviews written in Arabic collected from souq.com.The data was annotated with three labels (1: positive 0: Neutral, −1: negative).In this experiment, and before performing any changes on the original data, the data was tested using several supervised classifiers (Naive Bayes, K-nearest neighbor and Support vector machine).The data was divided into 70% for training and 30% for testing.All the classifiers used word embedding that is generated using AraVec with a dimension equals to 300 (Soliman, Eisa & El-Beltagy, 2017).After the training process for each classifier, the testing phase for each classifier's performance and ability to classify the testing data was performed.Accuracy was used to assess the performance of each classifier.Accuracy is calculated by dividing the number of correctly classified reviews by the number of all reviews.The reported accuracy was equal to 54.18% using the SVM classifier, 49.99% using the Naïve Bayes classifier and 52.17% using the K-nearest neighbor classifier.Next, the data was fed into the augmentation tool where the size of the data was increased by almost 10 times.The generated dataset was tested using the same classifiers.In comparison with the previous results, the accuracy was increased by 42% on average.In details, the accuracy rates obtained by each classifier, using the augmented dataset, were 97% using the SVM, 87% using the NB and 91.66% using the K-nearest neighbor as illustrated in Fig. 3.This improvement was expected-as increasing the dataset size will subsequently improve the training process which leads to improving the overall performance of the classifier.

Experiment 2: testing the efficiency of each transformation rule
The aim of this experiment was to test the accuracy of each transformation rule independently.To achieve this goal, it was preferable to design a small artificial dataset, which consists of 40 statements with positive sentiment, 32 statements with negative sentiment and 27 neutral statements.A total of 99 sentences were carefully designed to align with the 23 transformation rules.Each sentence was processed by the augmentation tool, and thus several sentences were generated for each input sentence.The generated sentences were manually inspected to test their validity.Rule accuracy is a measure that evaluates the ability of a given rule to generate correct and meaningful sentences.Rule accuracy is calculated by dividing the number of correct sentences generated by a given rule by the number of all sentences generated by that rule."A correct sentence" means a grammatically correct and meaningful sentence.Table 9 shows the accuracy that was obtained for each rule.As can be seen from the table, all of the rules secured high accuracies.This means that the rules are capable of generating correct sentences.When examining the sources of error, we discovered that it was caused by improper synonymous words generated by the Arabic WordNet.It is important to note here that Arabic WordNet covers only 9.7% of the Arabic lexicon or vocabulary.

Experiment 3: the efficiency of negation rules
The goal of the third experiment is to assess the capability of the Negation module in order to generate correct sentences.A small artificial dataset which consists of 26 positive sentences and 24 negative sentences was created for this purpose.It should be mentioned here that the Negation module is responsible for inserting proper negation particles into the input sentences.Negation flips the polarity of the input sentence.This means that positive sentences will become negative and vice versa.All the resulted sentences from the Negation module are correct with their respective labels properly flipped.

CONCLUSION
In this study, a novel data augmentation framework for Arabic textual datasets for sentiment analysis was presented.In total, 23 transformation rules were designed to generate new sentences from the input ones.These rules were designed after carefully inspecting Arabic morphology and syntax.To increase the number of generated sentences for every rule, Arabic WordNet was used to swap the words with their respective synonyms.These rules preserve the labels of the input sentences.This means that if the input sentence has a positive label then the generated sentences also have positive labels.By the same token, if the label of the input sentence is negative, the labels of the generated sentences are also negative.The same is true for the neutral label.A Negation module was also designed to insert negation particles into Arabic sentences.This module inverts or flips the labels of the generated sentences, as this is the effect of negation particles on the polarity of statements.Experimentally, we tested the proposed framework by conducting three experiments.The first experiment has demonstrated the effect of increasing the dataset size, using the augmentation tool, on classification.As expected, the accuracy improved in all the classifiers.This indicates that the quality of the generated sentences was high.The second experiment was designed to test the accuracy of each transformation rule.An artificial dataset was designed for this purpose.All rules scored extremely high accuracies.The third and last experiment used an artificial dataset to assess the quality of the generated sentences from the Negation module.The experiment reveals that all generated sentences were correct with proper associated labels.

Table 1
Statistical properties of Arabic WordNet.
Duwairi and Abushaqra (2021), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.469 and show how these sentences are transformed into new sentences, → means that the RHS of the rules are equivalent to the LHS:

Table 3
Transformation rules based on Arabic grammar.

Table 9
Accuracy rate per transformation rule.