SVM and k-Means Hybrid Method for Textual Data Sentiment Analysis

The goal of this paper is to propose a hybrid technique to improve Support Vector Machines classification accuracy using training data sampling and hyperparameter tuning. The proposed technique applies clustering to select training data and parameter tuning to optimize classifier effectiveness. The paper reports that better results were obtained using our proposed method in all experiments, compared to results of method presented in our previous work.


Introduction
Emergence of social networks and spread of Internet-connected smart devices was followed by explosion in data available for collection and processing, which offered serious technological and computational challenges together with new attractive possibilities in research, adoption and application of new and existing data science and machine learning techniques.It soon became obvious that novel techniques are required to effectively adopt and apply existing approaches.Support Vector Machines (abbr.as SVM) is one of the most widely used techniques which has proved its efficiency in different tasks and domains.It is very flexible to parameter tuning, as well as internal modifications, which allows to improve its performance and accuracy.Pang et al. (2002), Amolik et al. (2016), Tripathy et al. (2016) evaluated the different machine learning algorithms like Support Vector Machines, Naïve Bayes and Maximum Entropy for movie reviews sentiment classification tasks and obtained best accuracy with SVM.Go et al. (2009), Kolchyna et al. (2015), Kharde and Sonawane (2016), Hamoud et al. (2018) also proved that the single SVM or SVM as a part of ensemble method perform the best by automatically classify the sentiment of Twitter messages.Korovkinas et al. (2017) used SVM and combination of it with Naïve Bayes for sentiment analysis in different domains of: movie reviews, Twitter and Amazon reviews.SVM achieved the better results as the standalone method to compare with NB.Rathor et al. (2018), Haque et al. (2018) showed that SVM can produce better results than other methods in sentiment analysis on Amazon product reviews.Liu and Lee (2018) reported SVM algorithm being the best option for Email sentiment classification.Al-Smadi et al. (2018) also reported that the SVM approach outperformed deep RNN approach in aspect-based sentiment analysis of Arabic hotels' reviews.Medhat et al. (2014), Ahmad et al. (2017), Manikandan and Sivakumar (2018) concluded in their reviews that SVM is one of the most frequently used machine learning algorithm for solving sentiment classification problem.
However, despite all advantages, it is often reported to be slow in terms of training time and general performance with big data arrays.The higher number of features is, the longer computation time it requires.There have been a number of efforts to speed up SVM, and most of them focus on reduction of the training set (Lee and Mangasarian, 2001;Lei and Govindaraju, 2005;Graf et al. 2005;Nandan et al., 2014;Wang et al., 2014;Mao et al., 2016;Mourad et al., 2017).These authors conclude, that properly selected training data can improve executing time with no losing or similar accuracy.In paper Korovkinas et al. (2018) we also applied training set reduction to speed-up SVM training, with only slight decrease in accuracy.
Manual hyperparameter selection is still one of the biggest issues, related to practical SVM research and application.Recent literature still does not provide any heuristic rules or rules of thumb for this task (Steinwart and Christmann, 2008), and it is usually required to test multiple classifiers with different sets of hyperparameters to achieve satisfactory performance.Grid search is often applied to solve this problem (Chen et al., 2011;Ahmad et al., 2018); it is also often integrated into SVM related packages, such as LibSVM (Chang et al., 2011) or scikit-learn (Pedregosa et al., 2011), to simplify research pipelines.Multiple attempts to tackle hyperparameter optimization problem can be identified in the literature, particularly using simulated annealing (Boardman and Trappenberg, 2006), its adoption to grid search (Jimenez et al., 2009), evolutionary techniques (Wu et al., 2007;Friedrichs and Igel, 2005), particle swarm optimization (Li-xia et al., 2011;Yongqi, 2012;Garšva and Danėnas, 2014), firefly algorithm (Chao and Horng, 2015).Other works focus on combined selection of both features and hyperparameters (Maali and Al-Jumaily, 2012;Yao et al., 2009;Sunkad, 2016).Osman et al. (2017) showed empirically that optimized hyperparameters significantly improved performance for k-nearest neighbours algorithm while prediction accuracy of support vector machines either improved or was at least retained.
k-Means (MacQueen, 1967) is one of the most popular and widely known techniques, used as standalone technique or in combination with others.Gu and Han (2013) proposed Clustered Support Vector Machine (CSVM) method, using k-Means for data dividing in clusters, in each to train linear SVM.Yao et al. (2013) used k-Means clustering algorithm to select the most informative samples into small subset from original training set for SVM training.Kurasova et al. (2014) presented an overview of techniques used for big data clustering and also identified k-means as one of the most popu-lar and efficient techniques.Gan et al. (2017) used k-Means to construct a pre-selection scheme, which obtains a subset of important instances as training set for SVM.They reported that proposed KA-SVM has the outstanding performance on both of classification accuracy and computation efficiency.Wang et al. (2018) improved the spam filtering speed and filtering accuracy using a fast content-based spam filtering algorithm with fuzzy-SVM and k-Means.k-Means was used to compress data with retain most of the effective information.The conceptual simplicity and efficiency helped us to choose it for evaluation in instance selection step of our approach.
The main goal of this paper is to present technique to increase accuracy of method presented in previous paper Korovkinas et al. (2018) and evaluate it for textual data sentiment classification.The rest of the paper is organized as follows.Section 2 describes SVM and k-Means clustering algorithms which were used in the experiment.In section 3, our method is introduced, whereas Section 4 gives a description of datasets and experimental settings used to evaluate proposed approach, together with results obtained during experimenting.Finally, Section 5 outlines the conclusions and sets guidelines for future work.

Relevant algorithms
This section describes techniques which are relevant to research, presented in this paper, such as Support Vector Machines (Cortes and Vapnik, 1995) and its highly optimized implementation in LibLinear library (Fan et al., 2008), as well as k-Means (MacQueen, 1967) technique.

Support Vector Machines
Support Vector Machines (abbr.as SVM) were initially introduced in (Boser et al., 1992;Cortes and Vapnik, 1995).Linear support vector machines (SVM) is originally formulated for binary classification.Given training data and its corresponding labels (x n , y n ), n = 1, . . ., N , x n ∈ R D , t n ∈ {−1, +1}, SVMs learning consists of the following constrained optimization: where w is vector variable, C determines the trade-off between the maximum margin and the minimum classification error, ξ n are slack variables which penalizes data points which violate the margin requirements.Note that we can include the bias by augment all data vectors x n with a scalar value of 1.The corresponding unconstrained optimization problem is the following: The objective of Eq. 2 is known as the primal form problem of L1-SVM, with the standard hinge loss.Since L1-SVM is not differentiable, a popular variation is known as the L2-SVM which minimizes the squared hinge loss: L2-SVM is differentiable and imposes a bigger (quadratic vs. linear) loss for points which violate the margin.Class label for test instance x is predicted using: , 1967) is one of the oldest and widely research clustering algorithm.It is often preferred due to its simplicity and generally very fast performance.
The main idea is to partition the input dataset into k clusters, represented by adaptivelychanging centroids (also called cluster centers); they are initialized using so-called seed-points.k-Means computes the squared distances between the input data points and centroids, and assigns inputs to the nearest centroid.Formally, to solve problem of clustering N input data points x 1 , x 2 , . . ., x N into k disjoint subsets C i , i = 1,. . .,k, each containing n i data points, 0 < n i < N , the following mean-square-error (MSE) cost-function is minimized: x t is a vector representing the t-th data point in the cluster C i and c i is the geometric centroid of the cluster C i .Finally, this algorithm seeks to minimize J M SE , where x t − c i 2 is a chosen distance measurement between data point x t and the cluster centre c i .An input data point x t is assigned to cluster i if it satisfies the following condition: Cluster centers c 1 , c 2 , c j , . . ., c k can be obtained with the following steps ( Žalik, 2008): Step 1: Initialize k cluster centres c 1 , c 2 , . . ., c k by some initial values called seedpoints, using random sampling.For each input data point x t and all k clusters, repeat steps 2 and 3 until all centres converge.
Step 2: Calculate cluster membership function I(x t , i) by Eq. ( 6) and decide the membership of each input data point in one of the k clusters whose cluster centre is closest to that point.
Step 3: For all k cluster centres, set c i to be the centre of mass of all points in cluster C i .

The proposed technique
The proposed hybrid method (abbr for further use -kmLSVM) contains three techniques: number of cluster selection, training data selection (denoted as kMeans-Part) and SVM hyperparameter selection.The main goal is to select representative training dataset and penalty parameter of the error term (C) for SVM training to increase accuracy of method presented in Korovkinas et al. (2018).For training dataset selection is used k-Means clustering algorithm.In experimental settings, it is assumed that the testing subset is 30%, therefore, the training data should be 70%."Results" is the final results set with the following classified sentiment: "positive" or "negative".Data was converted to a matrix of TF-IDF (term frequency -inverse document frequency) features.The diagram and algorithm of the proposed method are presented in Fig. 1 Further, detailed formalization of the algorithm is presented.Here, function Performance(function(.))denotes performance obtained after running particular function; hence, the main goal is to find the optimal cluster configuration which would be optimal in terms of time and separation.Visual inspection was applied for selection in this experiment, although, other measures could be applied as well.

Dataset
In this paper are used two existing datasets: The Stanford Twitter sentiment corpus (sen-timent1403 ) dataset and Amazon customer reviews dataset4 .The Stanford Twitter sentiment corpus dataset is introduced by Go et al. (2009) and contains 1.6 million tweets automatically labeled as positive or negative based on emotions.The dataset is split to 70% (1.12M tweets) for training and 30% (480K tweets) for testing.Amazon customer reviews dataset contains 4 million reviews and star ratings; it was also split to select 70% (2.8M reviews) entries for training and 30% (1.2M reviews) for testing.Training and testing data has been cleaned and preprocessed before passing it as the input for training classifier.Preprocessing pipeline included removing redundant tokens such as hashtag symbols @, numbers, http for links, punctuation symbols, empty strings, etc.

Experiments
The main goal of this research is to compare our proposed technique with results of presented method in Korovkinas et al. (2018).Two experiments are performed in this paper: one experiment with the Stanford Twitter sentiment corpus dataset (sentiment140) and second experiment with Amazon customer reviews dataset (Amazon reviews).We reuse technique presented in Korovkinas et al. (2018) (abbr.for further use Subset30K) with testing dataset divided into subsets, which contain 30K rows of a dataset and applied our proposed technique (see Section 3) on it.Testing data for first experiment contains (480K tweets) and for second experiment (1.2M reviews).For kmLSVM training data is selected using proposed method (see Section 3) and contain 70K rows of senti-ment140 training dataset for the first experiment and 70K of Amazon reviews training dataset for the second.For Subset30K training data (70K rows of training datasets) is selected randomly (Korovkinas et al. 2018).Table 1 shows detailed experimental settings for methods (see subsec 4.3).For more detailed results to see the impact of each part of our proposed method we perform set of experiments with the only kMeans-Part by using SVM parameters as they are in LinearSVC module (the part of scikit-learn package (Pedregosa et al., 2011)) and after the set of experiments with the whole method kmLSVM including "SVM parameter tuning" part.The results are compared with Sub-set30K.Python programming language was used to implement and evaluate the proposed technique.For SVM classification was used LinearSVC module, implemented in terms of LibLinear (A Library for Large Linear Classification5 ), considering its flexibility in parameter tuning and better fitting to large numbers of samples (Pedregosa et al., 2011).Simple iterative search was applied to select C parameter in range [1,10].k-Means clustering are implemented by using Kmeans module from scikit-learn package.k-Means were run three times with random initialization and different seeds, with best output selected as final result; the number of iterations was set to 100.Data was converted to a matrix of TF-IDF features, before passed to SVM and k-Means algorithms.To get more various dataset for training data, stopwords were removed before was passed to k-Means input.Initialize k cluster centres are defined using random sampling (default parameter of KMeans module).Workstation with processor Intel(R) Core(TM) i7-4712MQ CPU @ 2.30 GHz and 16.00 GB installed memory (RAM) was used to run experiments.

Effectiveness
Effectiveness is measured using statistical measures which are used often for similar tasks, particularly, accuracy, precision, recall and F 1 score.Formulas are presented below (Sammut and Webb, 2011) where TP -count of correctly classified "positive" sentiments, TN -count of correctly classified "negative" sentiments.FP -count of incorrectly classified "positive" sentiments.FN -count of incorrectly classified "negative" sentiments.

Results
Two experiments were performed to evaluate the effectiveness of proposed technique in terms of accuracy, precision, recall and F 1 score (see subsec 4.4).Fig. 2 presents clustering step results.It was assumed that optimal number for clusters will be selected dependently on k-means execution time (for sentiment140 max time 20 sec.and for Amazon reviews max time 100 sec.) and number of clusters should be closest to 100.Visual output identified that optimal number of cluster for sentiment140 dataset (Fig. 2a) is 100 and for Amazon reviews -120 (Fig. 2b).Table 2 shows averaged results for sentiment140 dataset, obtained using kMeans-Part and kmLSVM techniques; they are compared to results obtained with 30K rows subset (Subset30K), generated according to Korovkinas et al. (2018).Average accuracy (ACC) of Subset30K is 76,87%, which is slightly higher than kMeans-Part (76,77%), but lower than kmLSVM (77,81%).Moreover, the results of kMeans-Part and kmLSVM are higher in terms of PPV, TNR, F 1 score are higher, compared to Subset30K average, which indicate better identification of positive and negative classes.Results are visually depicted in Fig. 3. Table 3 shows averaged results of Subset30K, kMeans-Part and kmLSVM for Amazon reviews dataset.The average accuracy of Subset30K is 87,63%, which is slightly higher than kMeans-Part, but however, it was outperformed by our proposed kmLSVM (88,32%).Moreover, the results of kmLSVM in terms of PPV, NPV, TPR, TNR, F 1 score are higher than Sub-set30K as well; kMeans-Part resulted in higher NPV(0.28%) and TPR(0.24%).Again, these results are visualized in Fig. 4.
However, the main goal of this research was to increase effectiveness of our method proposed in previous paper (Korovkinas et al. 2018), comparison was done only between them, to show effectiveness of kmLSVM.In this case its rather difficult to directly compared obtained results to results obtained by other authors, considering heterogeneity of testing platforms, subtleties in implementations or their configurations, etc. Experimental alignment of our implementation to be tested with similar approaches is among our future works.

Conclusions and future work
This paper proposes the method to improve SVM classification accuracy by subsetting training data using clustering.The experimental results show that our method is characterized by higher accuracy than the method presented in our previous paper Korovkinas et al. (2018).The main advantage of introduced method, compared with aforementioned method, is the training data selection.Training data for Subset30K is selected randomly, which might negatively affect accuracy in different runs; therefore, multiple runs are required for more objective results.In this paper we advocate the use of clustering-based instance selection method (kMeans-Part) using data points with MAX, MIN and AVG distances to each cluster center.Results, obtained with kMeans-Part, are comparable with effectiveness of Subset30K (Korovkinas et al., 2018), with slightly lower accuracy, but higher PPV, TNR, F 1 score values.Therefore, effectiveness of the whole technique (kmLSVM) are higher than Subset30K in both terms of average accuracy and other evaluated metrics for both sentiment140 and Amazon reviews datasets, which is considered as a positive and significant step towards more efficient sentiment analysis and overall SVM-based classification techniques.
There are several directions to work on increasing the classification accuracy of proposed method.Advanced feature engineering techniques might have significant impact on classifier effectiveness.Moreover, it would benefit of more extensive application of natural language techniques, including part-of-speech (POS) tagging, named entity recognition, lemmatization, abbreviation resolution, relation extraction, etc. Aspectbased sentiment analysis is also one of the fields, which could make use of proposed techniques.Hence, thorough investigation of novel approaches in the context of our proposed techniques are among our future works.

D
-set of data; D size -dataset size; D results -set of results from SelectTrainData function; k opt -optimal number of clusters; cluster range -max number of clusters; Subset size -size of testing data subset is divided into; k-Means res -set of results after k-Means algorithm; Pos train -set of positive sentiments; Neg train -set of negative sentiments; Pos data -set of selected sentiments from Pos train set; Neg data -set of selected sentiments from Neg train set; Train -set of training data; Test -set of testing data; Sample count -number of samples to randomly select; C opt -optimal value for SVM penalty parameter of the error term; Effect -effectiveness of SVM classification (see subsec 4.4). 1. Select optimal number of clusters: return k opt ← arg max(Performance(kmeans(k))), k ∈ cluster range 2. Training data selection: SelectTrainData(D, D size , Subset size ): D results ← {} n ← 0 m ← Subset size while(len(D results ) <= D size ) : EvaluateKMeans(k opt , random.sample(D,Subset size )) with MAX distance to cluster center val2, value with MIN distance to cluster center val3, value closest to MEAN distance to cluster center D results ← D results ∪ k-Means res : Accuracy: ACC = T P + T N T P + T N + F P + F N Precision.Positive predictive value: P P V = T P T P + F P Precision.Negative predictive value: N P V = T N T N + F N Recall.True positive rate: T P R = T P T P + F N Recall.True negative rate: T N R = T N T N + F P Harmonic mean of PPV and TPR: F 1 score = 2 Fig. 2: Clustering step results Fig. 3: sentiment140 results Fig. 4: Amazon reviews results sample(Neg train , Sample count ) Pos data ← random.sample(Postrain , Sample count ) train neg , test neg ← train test split(Neg data , test size) train pos , test pos ← train test split(Pos data , test size) Train ← train pos ∪ train neg Test ← test pos ∪ test neg C opt ← arg max(SVM ACC (C)), C min <= C <= C max SVM opt ← trainSVM(C opt ) Effect, sentiment ← predict(SVM opt ) return Effect, sentiment

Table 2 :
Results of the proposed method applied on the sentiment140 dataset

Table 3 :
Results of the proposed method applied on Amazon customer reviews dataset