Support vector machine parameter tuning based on particle swarm optimization metaheuristic

. This paper introduces a method for linear support vector machine parameter tuning based on particle swarm optimization metaheuristic, which is used to ﬁnd the best cost (penalty) parameter for a linear support vector machine to increase textual data classiﬁcation accuracy. Additionally, majority voting based ensembling is applied to increase the efﬁciency of the proposed method. The results were compared with results from our previous research and other authors’ works. They indicate that the proposed method can improve classiﬁcation performance for a sentiment recognition task.


Introduction
Textual data analysis is a very challenging area.We need to understand the whole context of the sentence because even a single word can change the polarity of a sentence, and this might have a significant impact on particular domains, such as medicine, stock prediction, etc.A support vector machine (SVM) is one of the most frequently used machine learning algorithms to solve sentiment classification problem in [2,32].Its efficiency has been proved to solve difficult tasks in different domains, such as image classification in [25], credit risk evaluation in [9], for sensor multifault diagnosis in [10], monitoring metal-oxide surge arrester conditions in [21], Parkinsonian disorders classification in [14], forecasting stock market movement direction in [38], sentiment analysis in [27,28,30], etc.The authors in [44] reported that a linear SVM achieves the best results consistently to SVM with different kernels including SVM-Poly.The authors in [7,16,26] also reported the linear SVM efficiency for binary text classification.Unfortunately, manual hyperparameter selection still remains one of the practical application issues, while recent literature still does not provide any heuristic rules or rules of thumb for this task in [41].Hence, it still requires training multiple classifiers with different sets of hyperparameters to obtain satisfiable performance.Such hyperparameter optimization is mostly guided by some heuristics, like genetic algorithm in [24] and [6], particle swarm optimization (PSO) in [45] and [17], and colony optimization in [22] and [46].Simple grid search is one of the most common choices to solve this problem [1] as it often integrated in different machine learning packages, such as LibSVM [5] or scikitlearn [13], which helps to simplify research pipelines.Particle swarm optimization is also a very promising option [19,23,29].One of its strengths is combination with other evolutionary techniques.In [36], the authors proposed an improved quantum behaved particle swarm algorithm based on a mutation operator.In [47], the authors presented the SVM parameter optimization technique based on intercluster distance in the feature space and a hybrid of the barebones particle swarm optimization and differential evolution.In [21], a differential particle swarm optimization to select parameters for support vector machines is applied.There are a number of works, which focus on the combined selection of both features and hyperparameters [31,42].
Ensembles of classifiers are one of the most challenging areas yet they often result in increased performance compared to single classifiers.In [34], the proposed ensemble method is based on static classifier selection involving a majority voting error and forward search for text sentiment classification.In [35], an ensemble system based on three classifiers, which are combined via a majority voting for the sentiment analysis of textual data is presented.In [4], the authors reported that their ensemble voting algorithm in conjunction with three classifiers performed better on Turkish sentiment classification problem.
Motivated by these improvements, this paper proposes a simple method to improve a linear support vector machine (LSVM) performance for textual data classification.The rest of the paper is organized as follows.Section 2 briefly introduces the algorithms, which were used in the experiment.In Section 3, our method is outlined with evaluation thoroughly described in Section 4 together with a description of datasets, experimental settings, and results.Finally, Section 5 outlines the conclusions and sets guidelines for future work.

Relevant algorithms
This section describes the algorithms relevant to research presented in this paper: Support Vector Machines [8,15] and Particle Swarm Optimization [11].

Support vector machines
The early foundations for support vector machines were introduced in [3,8] and later extensively described in [43].Basically, they attempt to find the best possible surface to separate positive and negative training samples in supervised manner.In this section, we focus on linear SVM [15], which is optimized for large-scale learning and, therefore, is used in this paper.
Nonlinear Anal.Model.Control, 25(2):266-281 Given training vectors x i ∈ R n , i = 1, . . ., l, in two class, and a vector y ∈ R l such that y i = {1, −1}, a linear classifier generates a weight vector w as the model.The decision function is sgn w T x .
L2-regularized L1-loss support vector classifier (SVC) solves the following primal problem: whereas L2-regularized L2-loss SVC solves the following primal problem: Their dual forms are: where e is the vector of all ones, Q = Q + D, D is a diagonal matrix, and Q ij = y i y j x T i x j .For L1-loss SVC, U = C and D ii = 0 for all i.For L2-loss SVC, U = ∞ and D ii = 1/(2C) for all i.

Particle swarm optimization
Particle swarm optimization was introduced in [11].Let a i (t) denote the position of particle i in the search space at time step t; unless otherwise stated, t denotes discrete time steps.The position of the particle is changed by adding a velocity, v i (t), to the current position, i.e.
with a i (0) ∼ U (a min , a max ).Velocity vector reflects both the experiential knowledge of the particle and socially exchanged information from the particle's neighborhood.For Global Best PSO, the velocity of particle i is calculated as http://www.journals.vu.lt/nonlinear-analysiswhere v ij (t) is the velocity of particle i in dimension j = 1, . . ., n a at time step t, a ij (t) is the position of particle i in dimension j at time step t, c 1 and c 2 are positive acceleration constants used to scale the contribution of the cognitive and social components, respectively, and r 1j (t), r 2j (t) ∼ U (0, 1) are random values in the range [0, 1], sampled from a uniform distribution.These random values introduce a stochastic element to the algorithm.
The personal best position, b i , associated with particle i is the best position the particle has visited since the first time step.Considering minimization problems, the personal best position at the next time step, t + 1, is calculated as in [12].
where f : R na → R is the fitness function.As with evolutionary algorithms, the fitness function measures how close the corresponding solution is to the optimum, i.e. the fitness function quantifies the performance or quality of a particle (or solution) [12].The global best position, b(t), at time step t, is defined as where n s is the total number of particles in the swarm.b is the best position discovered by any of the particles so far -it is usually calculated as the best personal best position.
The global best position can also be selected from the particles of the current swarm, in which case [12,48] b(t) = min f x 0 (t) , . . ., f a ns (t) .
3 The proposed method The main goal of the proposed method (further as LSVM PSO ) is to select penalty (cost) parameter of the error term C for linear SVM training to increase the accuracy of the method presented in [27].The starting C value is defined by using cross-validated gridsearch over a predefined grid of possible C values. Figure 1 and Algorithm 1 present a modified method, initially introduced in [27], with additional steps (particularly, step 3).LSVM PSO  Algorithm 2 contains a pseudo code of the proposed LSVM PSO method.The main idea of the method presented in [27] is based on the selection of the training data size subject to the subset of split testing data.Thus, the testing data is split into equal subsets, and the size of training data is calculated on the basis of the size of the first subset.This is done with intent that a smaller training dataset would significantly reduce time and computational effort to train the classifier, which would provide similar or slightly smaller accuracy.This approach is also extended to majority voting based ensemble (further it is referred to as CL{n}_LSVM PSO , n -number of classifiers), which will be shown to improve classification performance for LSVM PSO as well.However, it introduces additional challenges, such as a decision on the number of used classifiers (LSVM PSO ).The proposed method should be applicable for both binary and multi-class tasks.The algorithm and diagram of the proposed method are presented in Algorithm 3 and Fig.   . . .

Training data
Testing data (td) Initialize particles with population size nP s.

Dataset
To measure the performance of the introduced method it is evaluated on the largest labeled datasets available: the Stanford Twitter sentiment corpus dataset1 introduced in [18], Amazon customer reviews dataset2 , and the Amazon3 product data dataset introduced in [33].We consider Books, Electronics, Kindle Store, Cell Phones, and Accessories datasets from the Amazon product data in particular.A brief description of the datasets is presented in Table 1.
Training and testing data has been preprocessed and cleaned before it was passed as the input of LSVM algorithm.It included removing redundant tokens, such as hashtag symbols @, numbers, "http" for links, punctuation symbols, etc.After cleaning was performed, all datasets were checked, and empty strings were removed.

Experiments
Two experiments are performed to evaluate our proposed method with our methods (Subset30K and kmLSVM) presented in earlier works [27,28].Ensembles of three (CL3_LSVM PSO ) and five (CL5_LSVM PSO ) classifiers, Stanford Twitter sentiment corpus, and Amazon customer reviews datasets are used in these experiments.During the experiments, the training data is randomly selected from the whole dataset, while the remaining part is used for testing.To obtain more objective results, 10 iterations of these experiments were performed, and the results were averaged.
Next, four experiments were done to compare the results with other authors' works.The datasets (Books, Electronics, Kindle Store, Cell Phones, and Accessories) used in [20,37,40] were selected.The descriptions of these datasets are presented in Table 1 (see Section 4.1).Although the Amazon reviews come in 5-star rating, the aforementioned authors used only two classes of the presented datasets.According to them, 3-star ratings are considered as neutral reviews meaning neither positive nor negative, hence instances with this class were discarded from datasets.The remaining classes were converted to binary as follows: reviews receiving 1-or 2-star ratings were labeled as "0", whereas reviews receiving 4 and 5 stars received a label "1".As in the first two experiments, the training data is randomly selected from the whole dataset, and the remaining part is used for testing.Additionally for Electronics, Cell Phones, and Accessories datasets, we perform experiments where linear SVM is used together with splitting dataset into training (70%) and testing (30%) subsets.It can also be observed that all experiments were performed 10 times to get more accurate results, and the average is taken as the final result.
Python programming language and scikit-learn [13] library for machine learning were used to implement and evaluate the proposed method.Linear SVC module, implemented in terms of LibLinear (a Library for Large Linear Classification4 ), was used to implement LSVM functionality.The data was converted to a matrix of TF-IDF (term frequency, inverse document frequency) features.
Table 2 shows the sizes of training and testing data for LSVM input.In experimental settings, it is assumed that the testing subset is 30%, therefore, the training data should be 70%, and subset size should be 30,000 instances (30%), then training data calculated dependently on subset size is 70,000 instances (70%).Then we should split all testing data into subsets containing 30,000 instances (the last subset is the remainder, and it could contain less than 30,000 instances) and run them separately one by one on LSVM.Parameter R, neighborhood for obtained C Gs , where PSO search is performed, is set to 0.1.
For experiments, there is used computer with processor Intel(R) Core(TM)i7-4712MQ CPU @ 2.30 GHz and 16.00 GB installed memory (RAM).

Performance evaluation
Effectiveness is measured using statistical measures, which are used often for similar tasks, particularly, accuracy (ACC), precision (positive predictive value PPV and negative predictive value NPV ), recall (true positive rate TPR and true negative rate TNR), harmonic mean of PPV and TPR (F 1 score).Formulas are presented below [39]: where TP -count of correctly classified "positive" sentiments, TN -count of correctly classified "negative" sentiments.FP -count of incorrectly classified "positive" sentiments.FN -count of incorrectly classified "negative" sentiments.Area under the Receiver Operating Characteristics (AU C) is also used to measure the quality of the model predictions.

Results
All experiments described in Section 4.2 are performed, and the results are compared.Table 3 gives average results for the proposed method in comparison with Subset30K and kmLSVM when Stanford Twitter sentiment corpus and the Amazon customer reviews datasets were used.The distribution of results per iteration is visually depicted in Figs. 3  and 4.
The quality measure of the model's predictions (AU C) clearly shows that the proposed method and its ensembles outperform the previously proposed methods (Subset30K and kmLSVM) on both datasets: the Stanford Twitter sentiment corpus dataset and Amazon customer reviews dataset.
Other metrics in terms of -accuracy, PPV, NPV, TPR, TNR, F 1 score -also show the superiority of the proposed method compare with Subset30K on both datasets.
The results between LSVM PSO and kmLSVM were also insignificantly better in terms of accuracy, PPV, TNR on the Stanford Twitter sentiment corpus dataset, while LSVM PSO lost slightly in terms of NPV , TPR, and F 1 score.On the Amazon customer Nonlinear Anal.Model.Control, 25(2):266-281       review dataset, LSVM PSO performed slightly better in terms of accuracy, NPV , TNR, and F 1 score.In the case of LSVM PSO ensembles, the results are better in all metrics to compare with kmLSVM and single LSVM PSO .It is difficult to explicitly compare the results obtained with results in other papers due to discrepancy in implementations, parameters, or tasks.Therefore, Table 4 presents the results of comparative analysis based on accuracy when the proposed method is applied to the same datasets and contains the same number of classes.Dataset splits into training and testing data are different and demonstrate that sufficient accuracy can be obtained using a smaller training subset.
Table 4 shows that LSVM PSO and its ensemble of three CL3_LSVM PSO resulted in higher accuracy compared to [37] and [40] when applied on the largest Books dataset and Kindle Store dataset [37].The proposed method and its ensembles also outperform CNN-S(+), CFM, and PFM when they were applied on Cell Phones and Accessories dataset to compare with [44].
However, it performed slightly worse compared to [20] when higher accuracy with Electronics dataset and Cell Phones and Accessories dataset.The introduced method was also slightly outperformed by linear SVM, while testing with Electronics dataset and Cell Phones and Accessories dataset.Yet, this approach proved to be efficient to select the cost parameter C for methods used in [27,28], as well as ordinary linear SVM classifier, and allowed them to become competitive when compared to the works of other authors.

Conclusions and future work
This paper explores the application of particle swarm optimization metaheuristics to obtain optimal cost parameter C for linear support vector machines.The main goal was to identify principles to increase accuracy of our methods presented in the previous papers [27,28].The results obtained with LSVM PSO are comparable with the performance of the aforementioned methods and resulted in improvements in all effectiveness metrics.Further, it is observed that ensembling results of multiple LSVM PSO classifiers, obtained with different subsets of dataset, resulted in improved accuracy, compared with a single classifier.It also shown that the proposed method can be applied separately on ordinary linear SVM.
The main advantage of the introduced method is that it can be easily applied on any linear SVM instance for textual data classification tasks on a large datasets and perform faster than ordinary linear SVM when it is used in combination with our method presented in [27].In this paper, we found that using only 70,000 instances for training instead of more than 20 million (Books dataset) to develop classifier still resulted in performance comparable to [37,40,44], and the results obtained are also competitive with state-of-theart models.
There are several directions to work on the proposed method.First, additional testing would be required to optimize the proposed method for practical applications.This requires a test for an optimal number of classifiers in the ensemble method in terms of trade-off for both performance and training/testing time, optimal subset size, as well as to optimize the implemented particle swarm optimization method for faster convergence to optimal results.Further, it will be tested with multiclass classification tasks, the functionality of built-in SVM implementation.

Figure 4 .
Figure 4.The Amazon customer reviews dataset.

Figure 4 .
Figure 4.The Amazon customer reviews dataset.

Figure 4 .
Figure 4.The Amazon customer reviews dataset.
|D train | number of instances in D train |D test | number of instances in D test R LSVM set of LSVM results LSVM sent class of sentiment count count of text instances of the certain class should be selected from dataset; Train count = (1/k) • Subset size • |D train |/|D test | k number of different classes Nonlinear Anal.Model.Control, 25(2):266-281 Run data preprocessing on Training and Testing data.2. Randomly select data of each presented class in Training data.class 1 ← (random.sample(class 1 , Traincount)) class 2 ← (random.sample(class 2 , Traincount)) class k ← (random.sample(classk , Traincount)) D train ← class 1 ∪ class 2 ∪ • • • ∪ class k 3. Split D train into Training data for tuning and Testing data for tuning.C ← LSVM PSO 4. Train LSVM(C) with D train 5. Split testing data in subsets and run on LSVM.

Table 1 .
The description of datasets.

Table 2 .
Dataset splits in experiments.

Table 3 .
Results of the proposed method.

Table 3 .
Results of the proposed method.

Table 3 .
Results of the proposed method.

Table 3 .
Results of the proposed method.

Table 3 .
Results of the proposed method.

Table 3 .
Results of the proposed method.

Table 3 .
Results of the proposed method.

Table 4 .
Results comparison