Sentimental text mining based on an additional features method for text classification

Owing to the emergence of the Internet and its rapid growth, people can use mobile devices on many social media platforms (blogs, Facebook forums, etc.), and the platforms provide well-known websites for people to express and share their daily activities and ideas on global issues. Many consumers utilize product review websites before making a purchase. Many well-known websites are searched for relevant product reviews and experiences of product use. We can easily collect large amounts of structured and unstructured product data and further analyze the data to determine the desired product information. For this reason, many researchers are gradually focusing on sentiment analysis or opinion exploration (opinion mining) and use this technique to extract and analyze customer opinions and emotions. This paper proposes a sentimental text mining method based on an additional features method to enhance accuracy and reduce implementation time and uses singular value decomposition and principal component analysis for data dimension reduction. This study has four contributions: (1) the proposed algorithm for preprocessing the data for sentiment classification, (2) the additional features to enhance the accuracy of the sentiment classification, (3) the application of singular value decomposition and principal component analysis for data dimension reduction, and (4) the design of five modules based on different features, with or without stemming, to compare the performance results. The experimental results show that the proposed method has better accuracy than other methods and that the proposed method can decrease the implementation time.


Introduction
The volume of data from social media and online activities (e.g., chat rooms, e-commerce, and blogs) is classified as big data, which allow us to easily collect a large amount of structured and unstructured data. To find valuable information, we must extract and analyze the collected data, and this trend refers to big data. Many researchers have proposed automatic text categorization and data analysis methods; these techniques include data mining, web mining, and text mining. The datasets regarding customers' opinions or reviews are often massive and hard to analyze; it requires additional approaches to summarize them. Many forums, product marketing websites, mobile applications, e-commerce websites, and related web resources have provided platforms for consumers to express their opinions. These consumer opinions could be studied to determine the public opinions and the behavioral trends of consumers for the strategies and marketing campaigns of companies, product preferences of consumers, and for monitoring reputations [1]. Review platforms have become more prevalent, and they are important resources for extracting and analyzing opinions. In addition, a customer who wants to buy a product will often look for information on the Internet to find relevant opinions; therefore, analyzing reviews has become increasingly important in the real world. Sentiment analysis (SA) can be used to analyze people's opinions, sentiments, emotions, and attitudes expressed in texts [2]. In many fields, sentiment analysis has achieved good results, especially in intelligent marketing [3], customer satisfaction [4], and sales prediction [5]. Moreover, it is a challenge to find the efficient features for representing the text.
In general, it is impossible for users to read all the reviews from the various review resources. Based on a brief summary, many researchers have been working on sentiment analysis for a long time. Sentiment analysis is the study of the computational methods for extracting the opinions, sentiments, emotions, and attitude expressed in texts towards an entity [2]. Sentiment analysis, also called opinion mining, sentiment mining, review mining or attitude analysis, is the task of detecting, extracting, and classifying opinions. In addition, sentiment analysis is focused on the different issues which are addressed in the review or text [6].
In text mining, there are two main parts: (i) extracting and selecting features, and (ii) utilizing an algorithm for classification. In extracting and selecting features, [7] employs unigrams, bigrams, and parts of speech (POS) to denote movie reviews. In addition, [8] represents their data using n-gram sequences with POS tagging. In terms of classification, [9] trains their classifier by inputting the matrix that transforms the data and using a TF-IDF method. Finding the relevant features for treating a text is very challenging. Because reviews usually contain less than 300 words, it is hard to find the features that represent the entity. In addition, [10] shows that many works do not have uniform experimental settings. To address these issues, this paper proposed additional features and an "SVD then PCA" method to enhance accuracy and reduce implementation time for text mining, and, based on stemming, designs five module experiments with different features to compare performance and explore what factors affect the classification accuracy. In summary, the objectives of this study are as follows: 1. Present a sentimental text mining method based on an additional features method to enhance the classification accuracy of big data analysis of sentiment reviews; electronic word-of-mouth to post their experiences and preferences for various products. Online product reviews deliver more accessible information to enterprises for understanding the perceptions and preferences of consumers. Many previous studies on sentiment mining collected product reviews to analyze product properties because consumers review the related information to determine whether to buy the product or not, and a decrease in the quantity of product information could help consumers make decisions. Indeed, reviews were seen as a diagnostic tool for reducing the uncertainty of purchasing a product [11]. [12] proposed an econometric preference measurement model to extract consumers' preferences from online product reviews. Furthermore, Archak, Ghose, & Ipeirotis [13] revealed that the review opinions of customers are useful for enterprise strategies. Sentiment mining. Sentiment analysis is a popular application in text analytics that employs data analysis on the text to understand the expressed opinions. Subjective text is usually conveyed by humans with typical moods, emotions, and feelings. SA is widely used, especially in social media analyses, and includes many techniques to implement natural language processing (NLP), information retrieval (IR), and structured/unstructured data mining. The main challenge is that real world data are unstructured [10]. There have been many research efforts in recent years to obtain important and useful information from these unstructured datasets. From the work of [10], sentiment analysis can be divided broadly into six tasks as follows: (1) Subjectivity classification [7]; (2) Sentiment classification [10,[14][15][16]; (3) Review usefulness measurement [17]; (4) Lexicon creation [18]; (5) Opinion spam detection [10]; and.
From the literature, data acquisition and preprocessing is the first step in sentiment mining, and this important step affects the whole process. The second step is to extract the features from the raw data and apply a machine learning method for classification. Therefore, this study summarized the reviews of sentiment mining for different categorization schemes and techniques, as shown in Table 1.
Feature extraction and selection. Feature extraction and selection have been widely discussed and analyzed in text mining for a long time. The aim of feature extraction is to represent documents as multidimensional vectors [23]. Feature selection or feature extraction techniques are employed to reduce the dimensionality of the corpus and improve the training time of the classifier. Feature extraction is used to extract new features by some functional mapping from all feature sets [24]. The critical problem of feature extraction is that when the extracted features have no meaning it is hard to interpret their outputs [20].
Feature selection makes the classifier more efficient by reducing the dimensionality of the corpus without reducing its accuracy. Many unsupervised feature selection methods have been proposed in the literature. The most popular methods are the document frequency (DF), term frequency inverse document frequency (TFIDF), term contribution (TC), term variance (TV), information gain (IG), mutual information (MI), and so on. Information gain has been shown to be more competitive than the other methods [5,20].
Singular value decomposition. In the area of linear algebra, singular value decomposition (SVD) is a reduced matrix computation, the eigenvalue decomposition can only be utilized on square matrices. The SVD technique is used when researchers want to obtain the eigenvalues and eigenvectors for a matrix [25]. That is, matrix A is factorized into the product of three matrices A = UDV T , where U and V are orthonormal and matrix D is diagonal with a positive real number. SVD has been applied in many fields; in many cases, matrix A is close to a low rank matrix which can be determined and which is a good approach to the data matrix., i.e., we can obtain matrix B of rank k, which is the best matrix close to A; in fact, we can try every k for different applications. Furthermore, SVD is defined for all matrices (rectangular or square) unlike many commonly used spectral decomposition method in linear algebra. In SVD, the eigenvalues can be employed as decision criteria to determine the matrix size for data dimension reduction.
Principle component analysis. Principle component analysis (PCA) [26] is a dimension reduction technique that can be employed to reduce a large set of variables to a small set such that the selected principal components retain most of the information from the original data. PCA is a statistical computation that transforms the correlated variables into a smaller number of uncorrelated principal components (PC). The first principal component accounts for most of the variability in the data, and each succeeding component accounts for as much of the remaining variability as possible. PCA is similar to factor analysis in multivariate statistics. In general, the number of components is smaller than the number of original variables in the data. PCA can be explained as fitting an n-dimensional ellipsoid to the data where each axis of the ellipsoid denotes a principal component. The covariance matrix of the data and the eigenvalues and corresponding eigenvectors of the matrix will be computed and calculated. Finally, the set of eigenvectors must be orthogonalized and normalized to unit vectors. Both SVD and PCA are global algorithms that can extract the main features of a dataset. PCA is focused more on the covariance matrix, whereas SVD is focused more on the data itself [27].
Machine learning classifiers. This study chose four popular classifiers; these classifiers are mostly employed in sentiment classification. The four classifiers are naïve Bayes (NB), support vector machines (SVM), maximum entropy (ME), and random forest (RF). Next, the four classifiers are introduced as follows.
Naïve Bayes. The naive Bayes (NB) classifier [28] is based on Bayes' theorem and is particularly appropriate when the dimensionality of the inputs is high as it is a simple probabilistic Table 1. Reviews of the sentiment mining for different categorization schemes and techniques.

Authors
Categorization Techniques

Turney (2002) [19]
Polarity determination PMI-IR classifier. From the basic Bayes' theorem, consider the probability of a particular document, d, being assigned to a class, c i, and x i , which is an individual word of the particular document. Then, P(c j ) and P(x i |c j ) are calculated from the training data, and P(x i |c j ) is also the conditional probability of x i appearing in a document of class c j . Although it is a simple method with a conditional independence assumption that cannot capture real-world situations, its advantages are simple and it has surprisingly good accuracy [28]. Maximum entropy. Maximum entropy (ME) is a useful tool in several NLP fields [29] that can be utilized to estimate any probability distribution. ME has been verified to be a viable and competitive algorithm in text classification. The ME principle is that when nothing is known, the distribution should be assumed to be uniform. This study is interested in ME classification which is sometimes better than naïve Bayes for text classification [30]. ME tries to find the parameters that maximize the likelihood of all the training data. [29] mentioned that the ME estimate of P(c|d) is an exponential form.
For example, the function will be triggered if the term "happy" appears and the sentiment of document is positive. The ME classifier is a probabilistic classifier that is a type of exponential model. Unlike the NB classifier, the ME classifier does not assume that the features are conditionally independent of each other. The ME classifier can solve the variant problems of text classification, such as language detection, topic classification, sentiment analysis, and so on. Support vector machine. Support vector machines (SVMs) [31][32] find a hyperplane in an n-dimensional space that clearly classifies the data points. To divide the different classes of data points, there are many possible hyperplanes that could be selected, and the objective is to obtain the hyperplane that has the maximal margin. Support vectors are data points that approach the hyperplane and impact the position and orientation of the hyperplane; support vectors are used to maximize the margin of the classifier. Omitting the support vectors will alter the position of the hyperplane because these points help us establish the SVM. To promote the power of SVM text classification, texts must be transformed into vectors. In a text document, let c j � {1, −1} (correspondingly positive and negative) be the class of document d j , from the Lagrange multipliers for the SVM, use the derivative of the primal parameters, w, to get its solution [31]. If the data points cannot be partitioned well, the data points will be transformed into a higher dimension to find a separable hyperplane by using a kernel function.
Random forest. Random Forest (RF) is a flexible and easy machine learning method [33]. RF is also one of the most useful algorithms because of its simplicity and because it can be used for both classification and regression tasks. The RF classifier averages multiple decision trees from random samples of the database. A decision tree partitions the dataset into smaller subsets and simultaneously builds the tree with decision nodes and leaf nodes. The random forest averages all trees to build a model with lower fluctuations. The RF can run on large datasets efficiently and handle a great deal of variables without deleting variables. The RF classifier employs the bagging and bootstrapping concept [33]; hence, the advantages of RF classifiers are: (1) they reduce overfitting by averaging multiple trees; (2) low variance: multiple trees can be applied to reduce instability in classifier performance where there are different classifications between the training and test data.

The proposed method
The goal of sentiment classification is to classify a document, text, or review into categories that are already labeled (e.g., positive, negative, happy, sad). The most challenging work for sentiment classification is how to improve the accuracy of the result. Because many factors can affect the analysis, such as different preprocessing steps, the level of the sentiment classification (document or sentence), various features, lexicons, and distinct machine learning methods. In previous works, many studies have shown the differences in the results for feature selection techniques, such as unigrams, bigrams, POS tagging [8], n-gram sequences with POS tagging [7], and TF-IDF [9]. Ravi & Ravi [10] showed that many studies do not have the same experimental setting; hence, this paper was based on Cheng [34] to extend the experiments on additional features for enhancing accuracy and apply the "first SVD then PCA" method for dimension reduction and shortening the running time for text classification. Furthermore, this study utilizes stemming to design five module experiments with different features to compare their performance and discover the factors that affect the classifier accuracy.
The procedure of the proposed method is shown in Fig 1. First, the collected dataset is employed for sentiment classification. Second, the preprocessing steps of tokenization, removed stop word, and POS tagging by R statistics are taken. Third, features are defined and extracted, including term frequency-inverse document frequency (TF-IDF), the sentiment score of each document, positive and negative frequencies and the number of adjectives and adverbs. Fourth, the classifier is used to train and predict the data. Finally, the results are evaluated.
Proposed algorithm. To easily understand the proposed method, we employ the collected data and present the five main steps to show the computational procedure of the algorithm.
Step 1 Dataset collection. First, we collected the most commonly utilized dataset, the Movie dataset [22,35], which consists of sentimental documents; the Movie review text is not easier to classify than other review texts. The dataset includes 1000 positive and 1000 negative sentiment reviews. We coded an Excel VBA (Microsoft) program to import the text file, and then the labeled sentiment documents were transformed into the MS Excel format.
The second dataset was collected from the OHSUMED dataset created by Hersh et al. [36][37]. The dataset contains 23 different cardiovascular disease categories. The classes C02, 10, 11, 14, and C20 are selected in the experiment because the five classes are related to peripheral nervous system blood vessels. The name of classes and number of features are shown in Table 2.
Step 2 Preprocessing. In general, the data collected from the source contain noise; the collected data always need to be processed by several steps before implementing various machine learning methods. This step has five preprocesses, including tokenization, stop word removal, stemming, POS tagging, feature extraction and manifestation [10]. The purpose of tokenization is to remove the punctuation marks in the text. These marks do not contribute to the accuracy of the classifier. Stop words are words we often used in an article, viz., a, the, an and so on. These words do not make the results better, and they sometimes degrade the results. Stemming reduces a word to its root form and ignores the POS of the word. Parts of speech tagging is the process used to identify the different parts of speech of words in the text. Because the data often involves noise, feature extraction is required to help researchers obtain the relevant information. This step used two R language packages called RTextTools and openNLP to process the POS. Feature extraction will be discussed in the next subsection. Apart from feature extraction, feature selection is also an important step, which will certainly affect the analysis result.
Step 3 Feature extraction and additional features. The study defined a feature set including the TF-IDF, frequency of positive terms, frequency of negative terms, frequency of adjectives, and frequency of adverbs, as shown in Table 3. This step converted all the documents into a matrix of TF-IDF weights, and at the same time, let the positive and negative frequencies form another feature set. Next, we utilized POS tagging to count the number of adjectives and adverbs, and then the additional features were added. Table 4 presents the TF-IDF parameter descriptions, and the proposed feature extraction algorithm is shown in Algorithm 1.  Step 4 Dimension reduction. Because the TF-IDF matrix is a large sparse matrix with many zero elements, it is difficult to analyze the matrix. Hence, this step employed the "SVD then PCA" method for dimension reduction of the matrix. After feature extraction, the preprocessed matrix was used as SVD input. The SVD technique was used to decompose the TF-IDF matrix such that the values close to zero were transformed to zero. Then, the PCA technique was applied to process the reduced matrix to decrease the matrix dimensions even further. The output of the PCA is shown in Table 5. Lastly, after reduction, the Movie dataset is reduced from a 2000 � 46467 vector space to a 2000 � 2000 vector space.
Step 5 Classification. After Step 4, four classifiers, including naïve Bayes, maximum entropy, SVM, and random forest, are applied to train the processed data for classifying the text into classes. This study set all parameters at default values for the four classifiers and used 10 times random sampling and ten-fold cross validation to verify accuracy. The detailed description and settings are shown in Table 6.
Step 6 Evaluation. This step utilizes accuracy to evaluate classification performance, the accuracy is calculated using a classified confusion matrix (as Table 7) for document-level sentiment classification with positive and negative labels [9]. The equations of accuracy is shown as Eq (1), because the experimental dataset has marked positive and negative sentiment reviews, this study based on confusion matrix to compute the accuracy by using Eq (1).

Results and discussion
Based on the proposed algorithm, this study collects two open datasets and utilizes different experimental modules to conduct the experiments and compare the results with the listing methods. The datasets are collected from websites which are widely used in text classification areas. The two datasets are a movie review dataset and a dataset of cardiovascular disease abstracts (OHSUMED). The detailed properties of the Movie and the OHSUMED datasets are shown in Table 8.

TF-IDF
Utilize unigram to convert the collected text into TF-IDE, the range is 0 to 1.

Freq. of positive
Count the positive terms in a document.

Freq. of negative
Count the negative terms in a document.

Freq. of adjective
Count the terms for POS is adjective.

Movie review dataset
Based on different parameter settings of the TF-IDF, this study employs stemming to obtain different features, designs five module experiments to compare with the listing methods and discusses what factors will affect the classifier accuracy. The five module experiments have different settings and features, as shown in Table 9. After step 2 and step 3 of the proposed algorithm, the feature set has 46467 attributes. To test the effect of different settings, the "SVD then PCA" method is compared with the listing methods. The different settings are 10 times random sampling and ten-fold cross validation to test the performance. Table 10 shows that the proposed method with additional attributes is better than without additional attributes in terms of the average accuracy of the five classifiers; the SVM-linear and ME methods are better than the other classifiers in terms of accuracy. Table 11 (reduced dimension) shows the comparison results between with and without dimension reduction for Module 1 and Module 4 under no stemming. Overall, the proposed method with additional attributes is better than without additional attributes for both with and without dimension reduction. The SVM and ME classifiers are more accurate in most settings. From the experimental results, there are three findings as follows: 1. Feature extraction: The proposed method performs best on Module 1 and Module 4, as shown in Table 10. Module 4 obtains the highest accuracy in all of the experiments, and the number of features reduces to 9.4% (4366/46467 in no stemming of Table 9). The effect of stemming is not evident in this experiment, as shown in Table 10.

Additional attributes:
To test the effect of additional attributes, we utilize the average of the five classifiers for Module 1 to Module 5, as shown in Table 10 and Fig 2. After the additional attributes are combined into the feature set, the results show that the additional attributes can obtain better results, especially with the SVM-RBF method.

Dimension reduction:
The results show that the accuracy with dimension reduction approaches the accuracy without dimension reduction, as shown in Table 11. The proposed additional attributes obtain a better accuracy in the Movie dataset. Therefore, the additional attributes and the "SVD then PCA" methods can enhance the performance in sentimental classification.

OHSUMED dataset
The second dataset is collected from the OHSUMED corpus, which was created by Hersh et al.
[ [35][36]. The dataset has 50216 documents in 23 categories as Table 2. The classification considered here is a multiple class classification that classifies the documents as class C02, 10, 11, 14, and 20. The full feature set contains 29385 attributes. To test the effect of different feature extraction settings, Table 8 shows the properties of the OHSUMED datasets, and Table 12 shows that the five module experiments have different settings and features for ohsumed dataset. Next, the SVD and PCA are employed with each configuration of settings, so that the effect of dimension reduction can be measured. Finally, the different settings are randomly divided into 10 groups for cross validation.

Findings
From the experimental results, some findings are summarized as follows: 1. Attribute extraction: From Table 10 and Fig 2, the results of the Movie dataset show that Module 4 and Module 5 are better than the other modules. Module 4 achieves a higher accuracy in the overall experiments, and the number of attributes is decreased to 9.4% (4366/46467 in no stemming of Table 9). Furthermore, Table 10 shows that the effect of stemming is not obvious in the experiments. In the OHSUMED dataset, as shown in Table 13 and  Table 12). Therefore, we find that Module 4 shows effects from stemming, which means that stemming can reduce the number of attributes and increase the computational speed.  2. Adding additional attributes: This study proposes adding additional features to improve the accuracy of text classification. i.e., increasing the frequency of positive and negative adjectives, and adverbs. In the Movie dataset, to test the impact of adding additional attributes, this study calculates the average accuracy of five classifiers from Module 1 to Module 5, and the results are shown in Table 10. Table 10 and Fig 3 show that adding additional attributes can increase the accuracy, especially for the SVM_RBF classifier. In the OHSUMED dataset, see Table 13 and Fig 3, the best experimental models are obtained with Module 1 and Module 4, and the effect of stemming is slightly improved in terms of average accuracy. In addition, Fig 3 shows that without stemming, Module 1 to Module 5 have better performance with additional attributes than without additional attributes.

Dimension reduction:
From the Movie and OHSUMED dataset experiments, Tables 11  and 14 show the results with and without dimension reduction, and the accuracy with dimension reduction is close to the accuracy without dimension reduction. After dimension reduction, the proposed method with additional attributes can obtain better results in the Movie dataset. To test the "SVD then PCA" method could shorten the implementation time in sentimental text mining, the two experimental datasets were implemented in R (R-3.2.1 version) on an Intel i7-3770k, 3.5 GHz CPU, Microsoft Windows 10 system. The total implementation time of five classifier is listed in Table 15, among five modules, four modules can reduce the total implementation time except Module 5. Therefore, adding additional attributes and dimension reduction are feasible for the proposed method. (1) The bold denotes the "SVD then PCA" method can reduce implementing time; (2) The experiment was implemented in R (R-3.2.1 version) on an Intel i7-3770k, 3.5