SVM Model Selection Using PSO for Learning Handwritten Arabic Characters

: Using Support Vector Machine (SVM) requires the selection of several parameters such as multi-class strategy type (one-against-all or one-against-one), the regularization parameter C, kernel function and their parameters. The choice of these parameters has a great influence on the performance of the final classifier. This paper considers the grid search method and the particle swarm optimization (PSO) technique that have allowed to quickly select and scan a large space of SVM parameters. A comparative study of the SVM models is also presented to examine the convergence speed and the results of each model. SVM is applied to handwritten Arabic characters learning, with a database containing 4840 Arabic characters in their different positions (isolated, beginning, middle and end). Some very promising results have been achieved.


Introduction
Research on Arabic characters recognition reveals a rapidly expanding field and is now a concern whose relevance is undisputed by the research community, which has devoted its efforts to reducing constraints and expanding the field of Arabic character recognition. Among the techniques used for Arabic handwriting recognition is the SVM introduced in the early 1990s by Boser et al. [Boser, Guyon and Vapnik (1992); Cortes and Vapnik (1995)], which has been very successful in many areas of machine learning. Today, it can be said without exaggeration that these machines have replaced neural networks and other learning techniques. The adjustment of the hyper-parameters of the SVM classifier is a crucial step in building an effective recognition system. For a long time, the model selection was carried out by a "grid search" method, where a systematic search is implemented by discretizing the parameter space using a fixed step [Xiao, Ren, Lei et al. (2014) ;Wojciech, Sabina and Andrzej (2015)]. More recently, model selection has been considered as an optimization task. In this context, an optimization algorithm is implemented in order to find all the hyperparameters that achieve the best classification performance. Among the existing optimization algorithms, the gradient descent method has often been used for SVM model selection [Ayat, Cheriet and Suen (2005); Jiang and Siddiqui (2019)]. Metaheuristic techniques were also used for SVM model selection. Genetic algorithms [Sun, Guo, Wang et al. (2017); Phan, Nguyen and Bui (2017)], evolutionary strategies [Liu, Liu, Yang et al. (2006); Phienthrakul and Kijsirikul (2010)] and taboo search metaheuristic [Zennaki, Mamouni and Sadouni (2013); Corazza, Di Martino, Ferrucci et al. (2013)] were used to find the best configuration of SVM parameters. In this work, the PSO technique was adapted for parameter selection in order to maximize the cross validation accuracy and a comparative study between different SVM models is presented. The rest of the article is organized as follows. Section 2 presents the SVM and the two multiclass approaches one-against-all and one-against-one. The PSO method is described in Section 3. Section 4 provides a brief description of the proposed recognition system. Section 5 describes the experimental results. Finally, Section 6 draws the conclusions of this study.

Support vector machine
Originally, SVM processes the binary classification (two classes). Considering the learning base S composed of input vectors , the classification of these vectors is known in advance. It is represented by the output vector: = {−1,1}. It is therefore sufficient to know the sign of the classifier to determine the class of the example. If S is of dimension m, then the output value of binary classifier is given by: (1) and ∀( , ) ∈ , 0 ≤ ≤ ∑ = 0 where, K is the kernel function, C the coefficient of regularization and the coefficients of Lagrange. The learning algorithm for SVM aims to find the hyperplane of maximum geometric margin that separates the data in variables space as shown in Fig. 1. Vapnik [Vapnik (1998)] was the first to introduce hyperplane concepts into support vector algorithms. To determine the hyperplane equation, the problem is modeled as a mathematical program that maximizes the geometric margin between the data, taking into account the correct classification of the training set. The effectiveness of SVM algorithm is because it combines two relevant ideas. The first is the change of landmark and input variables to another feature space. This double change simplifies the construction of nonlinear classifiers by using only the hyperplanes in the feature space. The second consists in constructing separating hyperplanes in the feature space with the widest geometric margin possible [Vapnik (1998)]. On the other hand, the SVM approach is based on a statistical foundation, a theory that easily justifies its statements. The choice of the kernel function k is very important; it must respect certain conditions and correspond to a scalar product in a high dimensional space. The conditions that K must satisfy to be a kernel function are as follows: It must be symmetric and positivesemi definite. The simplest kernel function is the linear kernel: � , � = . (2) Thus, in this case the linear classifier is used without changing the space. The kernel approach generalizes the linear approach. The linear kernel is sometimes used to evaluate the difficulty of a problem. The kernels commonly used with SVM are expressed as follows: Polynomial Gaussian (RBF) Multi-class extensions SVM is binary in their origin. However, real-world problems are in most cases multiclass. Therefore, multiclass SVM reduce the problem to a composition of several two-class hyperplanes to draw the decision boundaries between the different classes. The principle is to decompose the examples into several subsets; each of them represents a binary classification problem. A separating hyperplane is determined for each problem by the binary SVM classifier. There are several decomposition methods in the literature, the most commonly used are:

One-against-all
This is the simplest and oldest method. According to Vapnik's formulation [Vapnik (1998)], it is a question of determining for each class k a hyperplane ( , ) separating it from all other classes. This class k is considered as the positive class (+1) and the other classes as the negative class (-1) so for a problem of K classes, K SVM binary is obtained. Fig. 2 shows a case of separation of three classes.

One-against-one
This approach consists in using a classifier for each two classes. This method discriminates each class of every other class, thus K(K-1)/2 decision functions are learned. For each pair of classes (k, s), this method defines a binary decision function. The assignment of a new example is done by voting list. An example is tested by calculating its decision function for each hyperplane. For each test, there is a vote for the class to which the example belongs (winning class).

Particle swarm optimization (PSO)
PSO is a stochastic optimization method developed by Eberhart and Kennedy in 1995 [Kennedy and Eberhart (1995)]. Originally inspired by the world of life, more specifically the social behavior of swarming animals, such as schools of fish and flocks of birds [Li and Li (2014)]. This method is based on a set of individuals called particles, originally arranged randomly, which can move in the search space. Each particle represents a solution to the problem and has a position and velocity . In addition, each particle has a memory that contains its best position visited ���⃗ and the best G position among the positions of all particles. The evolution of the algorithm equations is given as follows: where, represents the coefficient of inertia, the coefficients 1 and 2 are constants defined empirically according to the relation ship 1 + 2 ≤ 4 and finally, 1 and 2 are random positive numbers that follow a uniform distribution over [0,1] [Kennedy and Eberhart (1995)]. The strategy of moving a particle is influenced by the following three components: 1. Inertia ( ( )): the particle tends to follow its current direction; 2. Cognitive component ( 1 1 ( − ) ): The particle tends to move towards the best position already visited; 3. Social component ( 2 2 ( − )): The particle tends to move towards the best position already reached by all the particles of swarm.

PSO for SVM model selection
The number of SVM model parameters depends on the type of kernel and their parameters. For example, the polynomial kernel has three parameters ( , ), therefore, the SVM model has four parameters if the regulation parameter C is added. In this case the search space is of dimension D=4. The PSO algorithm maintains a set of particles. Each particle represents a candidate solution to the studied problem and considered as an object that has the following characteristics.
• The current position is a vector = ( 1 , . . , ) where, 1 , . . , represents the values of SVM model parameters and D the dimension of the search space.
• The current velocity of the particle = ( 1 , . . , ) is used to gradually modify the values of SVM model parameters according to Eqs. (7) and (8). • The best position visited of the particle = ( 1 , . . , ) is used to store the best parameter values found by the particle (a previous value of ). • The best position among all particle positions = ( 1 , . . , )represents the best values of SVM parameters found so far. The algorithm is expressed as follows: Initialize the number of particles N. Initialize velocity and randomly initialize the position of each particle. Set the maximum number of iterations .  (7) and (8) End for t = t + 1 Until ( > ) Return

Recognition system of handwritten Arabic characters
In the context of this study, a character recognition system has been developed. This section provides a brief description of the database and the technique used to extract the features.   (2017)] was implemented. The principle is to divide the image into 49 zones or 10×10 pixel size blocks, and then count the number of black pixels in each zone. Fig. 5 shows, for example, the letter 'jim', which is written in Arabic ‫'ج'‬ in its isolated form. The selection of parameters is typically performed by minimizing the generalization error, so we used cross-validation (k-fold), this method consists of dividing the learning set into k disjoint subsets of the same size, then learn about the k-1 subsets, and test on the k th part this process is repeated k times, the cross validation accuracy is obtained by calculating the average of k previous accuracy. In these experiments we used k=5 (5-fold). First, the Grid Search method for SVM model selection was used to compare the results obtained with the PSO method. Since the Grid search is only used for selection of models that use few parameters (≤ 2), it was implemented with RBF and Laplacian kernel. There are two parameters, C and γ; the range of C used is [10 0 , 10 1 , … , 10 4 , 10 15 ] and the range of γ is [10 −15 , 10 −14 , … , 10 1 , 10 2 ], for 16x18=288 iterations. The results obtained for one-against-one approach are illustrated as follows: According to these results, the surface area of the good results (Accuracy >90%) of Laplacian kernel is large compared to the RBF kernel, and the latter gave the best results. The same can be said about the two SVM approaches used, the surface of the good results of the one-against-all approach is large compared to the one-against-one, and the results of the latter are the best. During these experiments, the CPU time (duration) was measured at each iteration, the results obtained are presented in Fig. 10. The results show that the one-against-one approach (1×1) is faster than the other approach (1×N) and the Laplacian kernel is faster than the RBF. The one-against-all approach becomes very slow when the γ parameter takes large values such as 1, 10 and 100. This is evident in Fig. 10 the iteration 18 and their multiples. Then, for the selection of SVM model parameters, the PSO method was used, in particular the gbest model [Frans (2006)]. For these experiments, 5 particles with coefficient values were used: ω = 0.7298 and C 1 = C 2 = 1.4962 , to ensure convergence [Dang and Luong (2011)]. Each particle encodes the SVM model parameters; the number of these parameters is varied according to the type of kernel. The values considered are ∈ [10 0 , 10 15 ], ∈ [10 −15 , 10 2 ], ∈ [−10, 10 2 ] ∈ [1,10 ]. If a particle exceeds the search space boundary, this particle will be repositioned at the boundary and its velocity is set to zero. The results obtained for one-against-one approach are expressed as follows:     According to the results obtained, the linear kernel converges rapidly towards the optimum, because the SVM model has only one parameter (C), and the value of this parameter does not significantly affect the model's behavior. On the other hand, the sigmoid kernel is the slowest because in this case, there are three parameters and the model is very sensitive to the change of these parameters. The best result in terms of recognition rate is for the RBF kernel in one-against-one approach. Finally, it can be noted that Grid Search is an effective technique to provide an overview of the search space (allows extracting promising regions), so to find good results it is necessary to use a local search or to launch the Grid Search a second time in the promising area. This method requires a lot of time to explore the search space. On the other hand, the PSO produces good results quickly.

Conclusions
Based on the experimental results obtained, the use of SVM approach for Arabic character recognition is strongly recommended due to its superior generalization ability to classify high-dimensional data, even when there is a large number of classes (in this case study: 96 classes). It is also important to note that the RBF kernel is the most suitable for the recognition of Arabic handwritten characters. Indeed this kernel has better results than the other kernels with a cross validation accuracy of 95, 49% (SVM one-against-one, = 1,75E − 05). As a result, the PSO is more effective than Grid search in selecting SVM models.