Information Gain Based Feature Selection for Improved Textual Sentiment Analysis

Sentiment analysis or opinion mining is the process of mining the emotion from a given text. It is a text mining technique that effectively measures the inclination of public opinions and aids in analysing the subjective information from the given context. Sentiment analysis evaluates the opinion of a sentiment as either positive or negative or neutral. Sentiments are very specific and with respect to the underlying content, it plays a very crucial role in depicting the real-world scenario. Sentiment analysis can be performed at three levels namely document level, sentence level and feature level. This paper proposes a novel Information Gain based Feature Selection algorithm that selects highly correlated features by removing inappropriate content. Using this algorithm, extensive sentimental analysis is performed at the document level, sentence level and feature level. Datasets from Cornell and Kaggle are exploited for experimental purposes. Compared to other baseline classifiers experimental results show that the proposed Information Gain based classifier resulted in an accuracy of 95, 96.3 and 97.4% for document, sentence and feature levels respectively. The proposed method is also tested with higher dimensional datasets namely Movielens 1M, 10M and 25M datasets. Experimental results proved that the proposed method works better even for high dimensional datasets.


Introduction
Sentiment analysis uses Natural Language Processing (NLP) and Machine Learning (ML) algorithms that directly quantifies the sentiments towards products, people or their ideas and assist in revealing the contextual polarity. Sharing and expressing people's views and opinions is as simple as possible in the current era. Social media generates a large volume of data and this enormous volume of data is very difficult to intercept, describe and analyse. Sentiment analysis is one effective solution to identify, 1 3 extract or mine measure and study the context of these big data. It is a computational approach where the sentiments, opinions and subjectivity of the text is analysed [1,2]. The applications of sentiment analysis extend widely from healthcare, business, e-commerce, financial services to even social networks and marketing [3,4]. It is also used for classifying product reviews, movie reviews, social media news, online reviews, English texts in documents and many more [5,6].
Document level sentiment analysis determines the overall opinion about the document. It assumes that each document express opinions on a single entity. The entire document is classified as positive and negative [7]. For example, if a customer reviews a product, the system determines the overall opinion and generates a prediction model. One drawback with document level sentiment analysis is that multiple entry evaluation and comparison is not possible [8]. Sentence level sentiment analysis determines whether each sentence has expressed an opinion. It discriminates the objective sentences expressing factual information with the subjective sentences expressing opinions. To measure the effectiveness of sentence level sentiment analysis, the following is done.
i. Identify whether the sentence has expressed an opinion or not. ii. Then, assess the polarity of the opinion One major drawback of sentence level sentiment analysis is that prediction becomes difficult if the objective sentence carries opinions. Sentiment analysis at the feature level or aspect level requires a finer level of analysis with the help of NLP techniques. The opinions are characterized by polarity and a target of opinions. To measure the effectiveness of feature level sentiment analysis, the following is done.
i. Identify the entity and aspect of the entity in question. ii. Then, access the opinion on each aspect. Feature level in general has three stages namely identification and extraction of object features, determination of polarity of opinions and estimation of sentiments. Feature level sentiment analysis will provide more fine-grained analysis on opinion targets and has a wider range of applications in e-commerce [9,10].
This paper proposes a novel Information Gain based Feature Selection algorithm that selects highly correlated features for classification. Using this algorithm, sentiment analysis is performed at document, sentence and feature levels. The key contribution of this paper is, i. build an efficient Information Gain based Feature Selection algorithm. ii. Gather features for extensive sentiment analysis. iii. Perform sentiment analysis at document, sentence and feature levels using the proposed algorithm. iv. Test the performance on high dimensional datasets.

Related Work
Sentiment analysis in textual context classifies the subjective information based on the correlation between the selected features. Many machine learning algorithms are used for sentiment analysis which improves the accuracy of document, sentence and feature level textual data [11]. A three-step approach was used for analysis in [12] which uses different preprocessing, feature extraction and postprocessing techniques. The authors in [13] presented an approach which recognizes the polarity of a sentence as positive, neutral and negative according to the semantic information learned from news articles and blogs. Sentence level analysis obtains the relevant categories of a single sentence relating to grammar, content and punctuation. A method was proposed in [14] to extract sentence level features and domain sensitive features based on a disambiguation word technique. The performance of the supervised analysis increased when combining sentence level features and domain sensitive features. The authors in [15] proposed a feature extraction technique using Information Gain for aspect level sentiment analysis. The textual data is preprocessed and the information related to the lemma, grammatical relations and the Part-Of-Speech (POS) tags were obtained. Sentiment analysis are performed for movie reviews and twitter data based on different algorithms. In [16], the unigram and bigram frequency values of the movie reviews were explored using Naïve Bayes (NB), maximum entropy and Support Vector Machines (SVM) and in [17] twitter data was analyzed using Naïve Bayes classification algorithm with the features obtained by Information Gain based approach. The sentiments for twitter messages are also classified by exploring the utility of the linguistic features obtained at the feature selection level [18].
The authors in [19] proposed a text classification process using machine learning techniques. Various classification algorithms namely NB, SVM, Logistic Regression (LR), Decision Tree (DT) and K-Nearest Neighbor (KNN) are used for scientific textual sentiment analysis [20]. The authors in [21] proposed an opinion question answering system for differentiating news stories in document and sentence levels using Bayesian classifiers. Along with machine learning, deep learning methods are also used to solve sentiment polarity problems as discussed in [22]. Deep convolutional neural networks based on Stanford Sentiment Treebank (SSTb) and Stanford Twitter Sentiment corpus (STS) parse tree methods were used by the authors in [23] for sentence level analysis. The authors in [24] proposed an approach for sentiment analysis in which the utility of linguistic features is explored for analyzing Twitter messages. Various algorithms namely ID3, NB, DT and Random Forest are used in [25] for classifying sentiments in English texts in both balanced and unbalanced datasets. The authors in [26] proposed a classifier in which polarity of subjective phrases are predicted using Dictionary of Affect in Language (DAL) for lexical scoring with WordNet.
Sentiment analysis detects opinions, attitudes, and feelings expressed in the text using automated tools [27]. The authors in [28] proposed an experimental work for aspect-level analysis on movie reviews based on a new kind of domain-specific feature-based heuristic approach by selecting linguistic features comprising of adverbs, adjectives and n-gram features. Latent Factor Models (LFM) is based on Collaborative Filtering (CF) gives good prediction accuracy [29]. Explicit Factor Model (EFM) is used to produce recommendations with better prediction accuracy. It is essential to consider the importance of sentences when classifying documents. A sentence classification model using deep neural networks was proposed in [30] where a gate mechanism was used for automatically determining the degrees of each sentence. Hence for analyzing the sentiments with good accuracy an Information Gain based Feature Selection algorithm is proposed which assess the sentiments at all three levels namely document, sentence and feature levels.

Proposed Work
The paper proposes a simple and novel feature selection algorithm based on the concept of Information Gain theory. The proposed Information Gain technique is able to extract/select highly correlated feature vectors at optimal time and this enables more accurate sentiment classification. The proposed algorithm fits and works well when tested with document, sentence and feature levels. The algorithm selects the feature that has highest relevance and occurrence to the target class and best application performance is rendered because most of the irrelevant and redundant features are removed.
• The algorithm estimates the mutual information between each feature and the class label. • The features with the highest correlations are extracted.
• The relevance between the after bites are measured using Information Gain as shown in Eq. (7). The mutual information between variable S and R is denoted as I(S;R) as shown in Eqs. (5) and (6).
With respect to a Class C and attribute A, the Information Gain I is calculated according to Eq. (7). Where H(C|A) is the conditional entropy of the class C given any attribute A. when H(C|A) =1, then I(C,A) will be equal to 1.
The algorithm finds initial set of features and computes IG for all of them. The first feature is taken and the algorithm appends the highly correlated features to it. This procedure is repeated until an optimal subset is reached. The IG is computed and the final set of features obtained are used to predict the sentiments at document, sentence and feature levels.

Document Level Sentiment Analysis
For performing document level sentiment analysis, the documents are aggregated and preprocessed. A document level sentiment analysis considers the entire document as a single entity and classifies the document as either positive or negative. This paper considers unigrams, bigrams and trigrams for evaluating the underlying sentiment. Term Frequency-Inverse Document Frequency (TF-IDF) is used to calculate the n-gram features vector as shown in Eq. (8).
Where TF id is the occurrence of the i th feature in the document D, {D} is the total number of documents, df i is the number of documents containing i th feature.
The algorithm calculates document level analysis by extracting the opinion words obtained from POS tagging technique for a set of documents. A seed list is prepared and candidate words are searched using n-grams according to Eq. (8). The TF-IDF is calculated and the IG is computed for all set of features in the feature set. The resultant attribute is added to the vocabulary from which the positive and negative feature sets are obtained.

Sentence Level Sentiment Analysis
Performing sentiment analysis at sentence and sub sentence level identifies fine grained expressions, polarities, strength of the expression and its relationship to the subject. Each sentence is POS tagged and has a dependency tree that is created using Stanford hex parser [31]. The proposed sentence level sentiment analysis algorithm outputs the overall sentiment of the sentence by completely analyzing phrases within sentences. The polarity of individual phrases is determined by Default Polarity Calculation (DPC) with WordNet and this method starts by identifying positive or negative words using the General Inquirer (GI) [32]. In case of a found word in the GI list, the WordNet dictionary searches for semantic corrections with respect to the GI list.
The sentence level sentiment algorithm performs POS tagging and creates a dependency tree for a given set of input sentences. The individual phrases are identified and their polarities are calculated using DPC method. The algorithm finds IG for each feature and the same procedure is repeated as discussed in Sect. 3.1 for finding the positive and negative sentiments.

Feature Level Sentiment Analysis
In feature level analysis, the TF-IDF is used for feature selection. The Document Frequency (DF) thresholding technique already proved in literature suffer from overfitting while selecting the maximum appearing features in the training data. Since DF thresholding suffers from overfitting, the conventional method is chosen. The proposed feature level sentiment analysis algorithm results in a reduction of redundant and unrelated features thus classifying the sentiments effectively.

Classification
The Classification algorithm classifies the sentiments based on various levels as positive or negative.

Experimental Results
The datasets for document level, sentence level and feature level sentiment analysis are taken from the Cornell repository consisting of 2000 documents, 10000K sentences and Kaggle repository consisting of 70000K features respectively. Table 1 shows the details of the datasets used. The results of document, sentence and feature level sentiment analysis are validated and verified using the metrics such as Precision, Recall, F1-Score and Accuracy. The experiments proved that the proposed algorithm outperformed all the other algorithms by providing the highest level of accuracy at all the three levels. Figs 1, 2 and 3 shows the Precision, Recall and F1-Score of the proposed and other existing classifiers.   Figure 4 compares the accuracy of the proposed algorithm with other baseline classifiers. This infers that the proposed Information Gain based classifier resulted in the highest accuracy at document, sentence and feature levels.
We attempted to analyze the performance of the proposed Information Gain based Feature Selection algorithm on document level, sentence level and feature levels for high dimensional datasets. The Movielens dataset with 1million, 10 million and 25 million ratings were chosen for experimental purposes. Table 2 presents the Precision, Recall and F1-Score of the proposed and the existing classifiers using Movielens 25M dataset. Figure 5 shows the accuracy of Movielens 1M, 10M and 25M datasets of the proposed algorithm for all three levels. Table 3 depicts the comparison of proposed and existing algorithms for the accuracy of Movielens 1M, 10M and 25M for document level, sentence level and feature level respectively. For higher dimensional datasets, the proposed method works better and outperforms the other existing baseline classifiers in terms of Precision, Recall, F1-Score and Accuracy.

Conclusion
Sentiment analysis is a frequently used text classification tool that evaluates the underlying sentiment and classifies it as positive or negative. This paper proposes an Information Gain based Feature Selection algorithm. Using this algorithm, an extensive sentiment analysis is performed at document, sentence and feature levels. The proposed Information Gain based Feature Selection method selects the features that has greater Information Gain with   the highest occurrence. Experimental results proved that the proposed Information Gain model produced high accuracy of 95%, 96.3% and 97.4% for document, sentence and feature levels than the existing baseline classifiers. Also, the proposed method is tested with Movielens dataset with 1 million, 10 million and 25 million ratings. Despite the size of the datasets, the proposed algorithm outperformed the state of art classifiers.
Funding Not applicable.
Data Availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Ethical Approval This article does not contain any studies with human participants performed by any of the authors.