Arabic Sentiment Analysis with Optimal Combination of Features Selection and Machine Learning Approaches

The main objective of this research study is to design a model that allows for the utilization of a novel technique for the implementation of sentiment analysis in the Arabic language. Sentiment analysis is an interesting task that includes web mining, Natural Language Processing (NLP) and Machine Learning (ML). Most of the research work on sentiment analysis was focused on the texts in the English language. Therefore, the research on sentiment analysis in the Arabic language and other languages are in the infancy stage. This study empirically evaluates three Feature Selection Methods (FSM) (Information Gain (IG), Chi-square (CHI) and Gini Index (GI)) and, three classification approaches (Association Rule (AR) mining and the N-gram model and the Meta-classifier approach) for the implementation of sentiment classification in the Arabic language. A number of related experiments have been carried out on the Opinion Corpus of Arabic (OCA). The results obtained from the experiments were favorable, depending on the algorithms used and the number of selected feature has proven that the use of FS method can increase the performance of sentiment classification in the Arabic language. The results of the experiments reveal that FS method is obtained to develop the classifier performance. Furthermore, the results of the experiment indicated that the use of CHI feature selection has produced the best performance for FS and the performance of meta-classifier a combination approach has outperformed the other approaches for sentiment classification in the Arabic language. In conclusion, this research study has proven that the combination approach (meta-classifier) with the chi-square FS method produces the most accurate classification technique, as high as 90.80%.


INTRODUCTION
Today, the use of online social media has grown extensively in numerous sectors ranging from a social chat between family members and friends, to doing banking transactions, purchasing fashion wear and to the expression of viewpoints.These online comments or opinions cover a variety of topics in books, movies, electronic products, cars, politics and eateries.This activity has raised the interest of different parties such as customers, companies and government in the analysis and investigation of these opinions.For customers, the rapid growth of e-commerce has encouraged more people to buy from online shops and stores.Consequently, people who began to review the comments about the products had the opportunity to learn from other people's experiences to get a general idea about these products that can help them in making the best choice (Rushdi-Saleh et al., 2011).
The efficient and accurate mining and identification of useful reviews to meet the needs of both present and potential shoppers have turned into a critical challenge for market-driven product form.As of recent time, data mining and natural language processing in the aspect of correct extraction of people's sentiments from a large quantity of reviews in unstructured text.
Sentiment classification is the main problem with a particular type of text classification that focuses on classifying reviews of overall sentiment polarity into negative or positive categories.There are a variety of techniques and methods for opinion mining or sentiment analysis.Most of the techniques are categorized into two main methods: supervised and unsupervised learning approaches.In the supervised machine learning approach, sentiment corpora are used for the training of classifiers.
There has been an extensive study on ways to overcome the problems faced in the implementation of Sentiment Analysis (SA) based on the comments in the reviews.Most of the works on the analysis of the text are mainly in the English language.The main reason for this occurrence was due to the lack of the resources for use in the sentiment analysis in languages other than English (Montoyo et al., 2012).
In this research study, several sentiment classification techniques have been designed for sentiment analysis in the Arabic language.Also, there is an empirical comparison of three feature selection methods (FSM) (Information Gain (IG), Chi-square (CHI) and Gini Index (GI)).Meanwhile, an investigation of a proposed classification of the FSM has been conducted using three machine learning classifiers (Association Rule (AR), N-gram model and the meta-classifier approach).The evaluation of these approaches is based on an Opinion Corpus of Arabic (OCA) (Rushdi-Saleh et al., 2011).

LITERATURE WORK
The study of sentiment analysis has attracted much attention recently.Several approaches have been developed for the implementation of sentiment analysis.These approaches are the lexicon-based approach, machine learning-based approach and the hybrid approach.Hai et al. (2011) proposed the use of Association Rule Mining that recorded the count of cooccurrence of opinion words and explicit features in the Chinese language.The association rule mining is utilized to develop a mapping from the review words to possible features.Likewise, Wang et al. (2013) employed the idea of association rule mining that is similar to Hai et al. (2011).Both their work involves the identification of a set of basic rules that comprises three feasibleways of extending the set of rules.Man et al. (2014) used association rules for the classification of sentiments of web reviews.Their study examines the use of the multi-domain datasets that contain reviews of several product types (fields) taken from Amazon.com.Alsaffar and Omar (2015) applied a model of Malay sentiment analysis that uses sent-lexicon with machine learning approach.They built a new model that generated a set of features for the training of a k-Nearest Neighbor (k-NN) classifier.Omar et al. (2013Omar et al. ( , 2014) ) investigated several feature selection methods with different machine learning methods.Meanwhile, Soliman et al. (2014) utilized a dataset that comprises 1000 tweets (500 tweets in the positive and 500 tweets in the negative) in the Arabic language.The target was the use of sentence-level sentiment analysis based on tweet restriction of 140 characters as a guide.They also investigated the effects of the use of dialectical language in their research study.As the corpus utilized was rather limited, the researchers appended the use of some Egyptian Arabic language for accuracy.The Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers were utilized in the classification of polarity.Ibrahim et al. (2015) conducted a feature-based sentence level approach for Arabic sentiment analysis.The dataset in the Arabic language was made up of two parts; the first part contains 1000 tweets in the dialectical language while the second part consists of 1000 tweets in MSA language.The Support Vector Machine (SVM) classifier used with a lexicon that has been built with gold-standard sentiment words in the case of this classification.The sentiment word was collected and annotated manually, after which the system expands and detects the sentiment orientation of new sentiment words automatically using synset aggregation approach and free online Arabic lexicons and thesauruses.

MATERIALS AND METHODS
This study proposes a supervised machine learning approach for sentiment analysis in the Arabic language.This technique used a meta-classifier combination approach which combines two classifiers named (association rules and n-grams).This section provides an overview of the architecture of the proposed technique in the Arabic sentiment analysis with a description of the functionality of each component in the Arabic sentiment analysis system. Figure 1 presents the architecture of the proposed Arabic sentiment analysis.As well, this technique used Arabic Opinion for Corpus (OCA) dataset, it contains 250 positive comments and 250 negative comments.As well, this approach involves the following phases: • Data pre-processing phase.
• Feature selection phase.
Pre-processing: Data pre-processing comprises four steps: normalization, tokenization, stop word removal and stemming.All of the reviews involve a preprocessing stage.In the normalization process, diacritics, repeated characters and social media tags are removed.After that, we remove certain stop words that are common in all of the reviews to avoid misclassifying the reviews (Eickhoff, 2015).
Feature selection method: Feature selection or feature engineering is an extremely important task in the domain of sentiment analysis and more generally in text categorization.Converting original documents to Fig. 1: The architecture of meta-classification approach for Arabic sentiment analysis feature vectors is the main step in any supervised learning approach to sentiment analysis.Thus, work along this line argues that selecting the right feature set determines the overall performance of sentiment classification (Xia et al., 2011).Consequently, this study investigates the strengths and weaknesses of existing features and feature sets.In addition, this study also studies their effect on the performance of the Arab Sentiment classification.Feature engineering is a very important task in the domain of sentiment analysis and more generally in text categorization.The purpose of the Feature Selection Method (FSM) is to decrease the number of features from data matrix that obtained from pre-processing phase and to remove irrele redundant, or noisy data.The selection process brings positive effects, improving the scalability, efficiency classification approach for Arabic sentiment analysis feature vectors is the main step in any supervised learning approach to sentiment analysis.Thus, work along this line argues that selecting the right feature set determines the overall performance of sentiment ., 2011).Consequently, this study investigates the strengths and weaknesses of existing features and feature sets.In addition, this study also studies their effect on the performance of the Arab Sentiment classification.Feature engineering is a very ortant task in the domain of sentiment analysis and more generally in text categorization.The purpose of the Feature Selection Method (FSM) is to decrease the number of features from data matrix that obtained from processing phase and to remove irrelevant, redundant, or noisy data.The selection process brings positive effects, improving the scalability, efficiency and accuracy of the text classifier.Additionally, the use of FSM can reduce the dimension of the data, improve accuracy by the removal of noisy features and speed up the training.Also, FSM can improve the speed and learn the effectiveness of classification tasks.In the feature selection phase, one or more types of software are used to process the training and test data for the extraction of descriptive information.This study used 3 feature selection methods for sentiment classification.

Information Gain (IG):
The Information Gain (IG) method is employed in the ranking of the most relevant features.Information Gain (IG) measures the relev of an attribute with respect to a class.Information gain (IG) is a very popular algorithm in Feature Selection (FS).It is utilized as a measure for term goodness in the and accuracy of the text classifier.Additionally, the use of FSM can reduce the dimension of the data, improve noisy features and speed up the training.Also, FSM can improve the speed and learn the effectiveness of classification tasks.In the feature selection phase, one or more types of software are used to process the training and test data for the f descriptive information.This study used 3 feature selection methods for sentiment classification.
The Information Gain (IG) method is employed in the ranking of the most relevant features.Information Gain (IG) measures the relevance of an attribute with respect to a class.Information gain (IG) is a very popular algorithm in Feature Selection (FS).It is utilized as a measure for term goodness in the area of Machine Learning (ML) (Haddi et al., 2013).IG measures the amount of information that is present in or absent from a feature.The value obtained in the calculation of IG is useful in deciding the correct classification on any class.
where ‫‬ሺܿ ሻ the probability that class is, ܿ occurs; pሺtሻ denotes the probability that the word, t occurs; and ‫ݐ‪ሺ‬‬ ̅ ሻ is the probability that the word,‫ݐ‬ ̅ does not occur.
Chi-square statistic (CHI): One frequently used feature selection algorithms is the CHI.This feature selection focuses on the measure of differences between term and group (Yang et al., 2009).For the tasks of categorizing texts, it measures the independence of two random variables; the occurrence of a term, t and the occurrence of a class, c.It is widely adopted in studies focused on categorizing texts because it performs well in comparison with other feature algorithms (Taşcı and Güngör, 2013).The CHI. value for each term, t in category, c can be calculated by using the following equations: where, N in the equation represents the cumulative count of Arabic training reviews while A is the count of Arabic reviews in class c and contains term t and B represents a count of Arabic reviews that do not belong to class c but contains term t.Meanwhile, C is the number of Arabic reviews that do not belong to class c and do not contain term t and D is the count of Arabic reviews that do not belong to class c and do not contain term t (Thabtah et al., 2010).

Gini-Index (GI):
The Gini-Index is one of the frequently used techniques in the quantification of the discrimination level of a feature.The original form of the Gini Index (GI) algorithm was used in measuring the impurity of attributes across classification (Alsaffar and Omar, 2014).A smaller impurity denotes the best attribute and vice versa.The following equation used to calculate this feature as: From the formula, the highest value, Giniሺ‫ݐ‬ሻ = 1 will be obtained if feature t appears in every document of class, ܿ .
Classification methods: This research study employs three classifier methods namely Association Rules (AR), N-gram model and Meta-classification (stacking) for the implementation of sentiment analysis in the Arabic language.The meta-classifier method is used to combine two classifier AR and N-gram model.This combination method is effective as it is simple to use and it provides accurate results (Franco-Salvador et al., 2015).The following is a description of the classifiers methods.
Association rules: It is common to use association rules in discovering data, text elements or patterns that co-occur for many times within a dataset.The patterns shown in association rules can be used to make predictions of the data.This research study employs the use association rules (Apriori algorithm).The following describes the steps involved in the algorithm: Frequent class terms generation: The common terms used in every category are obtained during the process of the generation of frequent terms.It is assumed that T = [t1, t2, t3... tm] is a set of terms.The term set for first candidates are obtained directly and the common 1term sets that achieve minimum support are retained produce a candidate for the 2-term set.In the next step the frequent 2-term set is identified the following minimum support.The process of producing subsequent candidates and frequent m-term sets continue until no more frequent term set can be produced from the training dataset.

Class association rule generation:
The parameters for the application of Class association rule is limited to the rule head and rule body.The rules which are applied to the classifier only show a category label.Generally, the class association rule (CAR) is expressed as݆ܶ → ‫,݅ܥ‬ where Tj is a set of frequent terms for the form [t1& t2 &…& tm] and this is referred to as rule terms.Meanwhile, Ci which is the category of this rule is known as the rule head.Each rule is accompanied with support and confidence.The rule, CAR is labeled as strong and frequent, if it has achieved the value of the threshold of minimum confidence (min.conf.) and minimum support (min.Sup.).The support of the rule, ݆ܶ → ‫݅ܥ‬ is the percentage of reviews for rule terms, Tj which is the cumulative number of the reviews in each observed class, Ci.The following presents, the calculation of the ratio for support: where, C (Tj→Ci) represents the count of reviews in the dataset that matches the terms of R and related to category C of R and N is the cumulative count of reviews in class data set.The following is the equation for the calculation of confidence of the rule, Tj→Ci: Prediction of new review class: Many methods are used for the classification of reviews based on AR.This research study focuses on two methods of classification of reviews namely 'ordered decision list (single rule prediction)' and 'majority voting (multiple rules prediction)'.The following describes the steps involved in a majority voting technique.The first step is to make a search for the list of rules.The second step is to identify the whole set of rules that matches the test reviews for classification schemes.The third step evaluates all the retained rules as to whether they belong to the same class when the review is classified in this particular class.Otherwise, the review is assigned to a class with a majority of all the retained rules.

N-gram classifier:
An N-gram is a technique with an N-character.It slices a long string of words into Nwords.Below is a description of the classification using the N-gram classifier.

Generating N-gram frequency profiles:
The first step involves the generation of Profiles.It is a very straight forward process that starts with the reading of the incoming text and counts the occurrences of the entire N-grams.Next, the text is divided into tokens that are only made up of letters and apostrophes.After that, it scans each token, in order to generate all possible Ngrams, from N = 1 to N = 7.Subsequently, it records the findings into a table to obtain the match for the Ngrams.To increase all N-grams and their counts are sorted into reverse order by the number of occurrences.Lastly, the file obtained is an N-gram frequency profile for a review.

Comparing and ranking N-gram frequency profiles:
A similar technique is used to measure the distances between every category of profiles (negative and positive) with the new review.Firstly, it is necessary to calculate the profiles of each class (negative and positive).This is followed by the calculation of the closeness of the profiles of many classes for the new review (unknown reviews).The class with the shortest distance are assigned to the reviews.This research study used the cosine similarity technique to measure the similarities of each class profile, CPi and the profile of the new review, ‫ܦ‬ : where k = (1, 2, 3.., m) is the number of N-grams in the class profile, CPi and the profile of the new review,‫ܦ‬ .

Meta-Classification (Stacking):
For combination using a meta-classifier, the outputs for all the class labels of component classifier are viewed as new features for meta-learning.Among the various kinds of classification models, Naïve Bayes (NB) is used to combine the output of the three classifiers.The stacking combination consists of two phases.In the first phase, a set of base-level classifiers is generated.In the second phase, a meta-level classifier is learned that combines the outputs of the base-level classifiers.When using a meta-classifier for combination, the outputs of all the labels of the class of the participating classifiers used as features for meta-learning.In this case, to combine the output of the tow classifiers association rule classifier and N-gram classifier, can use as meta-classifier the Naïve Bayes (NB).The formula of the NB as metaclassifier, given the output of two classifiers O ଵ, O ଶ : where, ‫‬൫ܿ หܱ ଵ, ܱ ଶ ൯ is the posterior probability of class ܿ , given the new output of the two classifiers ܱ ଵ, ܱ ଶ, while ‫‬ሺܿ ሻis the probability of class.
Experimental setup: Several experiments were conducted to assess the proposed model.One of them was the evaluation of the performance of the classification approach that used Opinion Corpus of Arabic (OCA).
All these algorithms were evaluated through the use of K-fold cross-validation.The stage is necessary as it tunes the parameters for the selection of the best methods for sentiment analysis in the Arabic language.The performance measure of these classification methods, the experimental results were classified into the following: True Positive (TP) is a set of reviews that has been correctly assigned to the category, False Positive (FP) is a set of reviews that is incorrectly assigned to the category, True Negative (TN) is a set of reviews that is correct but it is not assigned to the category.Finally, False Negative (FN) is a set of reviews which is incorrect and it is not assigned to the category.As for this research study, F1 and Macro-F1 measures were being used.The following equation illustrates these matrics:

RESULTS AND DISCUSSION
Both classifiers, N-gram model and Association Rule (AR) were initially implemented to the whole document-term feature space to examine the overall performance of the classifier through the accuracy performance of the sentiment analysis in the Arabic language without used feature selection/reduction methods.Table 1 tabulates the results of the experiments obtained through the use of the Association rule classifiers (AR) and N-gram model.The results of the experiments that were implemented without the use of FS method.
The subsequent sections of this study examine the effects of the individual FS methods on the performance of the classifier.The purpose is mainly to highlight the best results achieved when the (AR) classifier was implemented with Information Gain, Chisquare and Gini index feature selection methods through the use of different features of varying sizes (100 to 500).This is illustrated in Table 2.The use of the three FSM has enhanced the performance of the Association Rule.This inferred from a comparison of result with and without the use FMS.Table 2 shows that the best in high accuracy performance is shown in line 6 with the application of feature selection, Chi-square of size 500.Also, when the minimum support is equal to 0.40 and the minimum confidence is equal 0.60, the Precision, Recall and F-Measure were recorded 86.800, 86.821 and 86.811 for respectively.This is followed by the implementation of the N-Gram classifier to the test set.Table 3 presents the performance of the N-Gram classifier with three FS specifically Gini index, Chi-square and Information Gain in terms of Precision, Recall and F-measure of the sentiment analysis in the Arabic language.Furthermore, the sizes of characters (N) used in the experiments were 3, 4, 5, 6 and 7.The best in high accuracy performance is shown in row 6, with the use of Chi-square FS methods with the size of the feature set of 500.The results of Precision, Recall and F-Measure were recorded at 85.800, 87.6 and 86.7 respectively.Based on analysis of the results, most of the results that achieved high accuracy were obtained when the number of characters (N) is 5 characters or fewer.The reason for this occurrence is most of the words in the Arabic language consists of 5 or fewer characters.This means that when the number of the characters (N) is raised then the accuracy of the performance may be reduced.
The best results were achieved with the use of a metaclassifier approach that is combined (association rule classifier with chi-square feature selection of size 500 and the N-gram classifier with the chi-square feature selection of size 500).The best results for Arabic sentiment analysis from the use of meta-classifier combination was recorded at 90.806, in terms of Fmeasure.
Generally, the use of Feature Selection (FS) technique has brought positive changes to the performance of all classifiers (Table 1 to 4).Also, the results have revealed that the meta-classifier combination approach is suitable for the implementation of sentiment analysis tasks in the Arabic language.It is clear that the results obtained from the combination method (Meta-classifier) are better than the results obtained from AR classifier and N-Gram model.This indicates that the proposed technique is a suitable method for the sentiment classification.The results tabulated in Fig. 2  for sentiment analysis in the Arabic social media.The principal contribution of this study is the examination of the performance of different Feature Selection Methods (FSM) and machine learning approach in terms of the F-measure.The results also demonstrate that the use of the best three FSM has produced enhanced results in comparison to the results obtained using the original classifier.Among these Fs methods chi-square, FSM yields the best results while the metaclassifier combination shows the best performance with a term f-measure of 90.807 for the sentiment analysis in the Arabic language.

Fig. 2 :
Fig. 2: Comparison of performances of meta-classifier combination with the baseline classifiers, individual classifiers and classifiers used in the previous study approach (Meta-classifier) with three feature selection specifically Chi-square, Gini index and Information Gain in terms of Precision, Recall and F-measure of the sentiment analysis in the Arabic language.An investigation was conducted on the base classifiers with different settings.Table4shows that the employment of the feature selection method had a noticeable effect on the sentiment analysis of Arabic language with the application of combination method (Meta-classifier).The best results were achieved with the use of a metaclassifier approach that is combined (association rule classifier with chi-square feature selection of size 500 and the N-gram classifier with the chi-square feature selection of size 500).The best results for Arabic sentiment analysis from the use of meta-classifier combination was recorded at 90.806, in terms of Fmeasure.Generally, the use of Feature Selection (FS) technique has brought positive changes to the performance of all classifiers (Table1 to 4).Also, the serves as a comparison of the performance of this research study with the previous studies that use the same dataset.The results in Fig. 2 also reveal that the proposed model (meta-classifier) has outperformed that of the baseline classifiers, individual classifiers and classifiers used in the previous study.CONCLUSIONThis study has presented a wide comparative study of three FSM methods and three classification

Table 1 :
Performance (precision, recall and F-measure) of the classifiers N-gram model and association rule

Table 2 :
The performances of the use AR approach for sentiment analysis in the Arabic language through the use of three FS with different

Table 4 :
Performance of meta-classifier combination for sentiment analysis in the Arabic language with the three feature selections of varying sizes Association rules setting -