Alternative Strategies in Learning Nonlinear Soft Margin Support Vector Machines

The aims of the paper are multifold, to propose a new method to determine a suitable value of the bias corresponding to the soft margin SVM classifier and to experimentally evaluate the quality of the found value against one of the standard expression of the bias computed in terms of the support vectors. Also, it is proposed a variant of the Platt’s SMO algorithm to compute an approximation of the optimal solution of the SVM QP-problem. The new method for computing a more suitable value of the bias is based on genetic search. In order to evaluate the quality of the proposed method from the point of view of recognition and generalization rates, several tests were performed, some of the results being reported in the final section of the paper.


Introduction
Assume that the aim is to discriminate among the instantiations of m concepts or classes, conventionally represented by the labels .Each instantiation comes from one and only one concept, refered as the true provenance class.The usual representation of instantiations is in terms of the values of a pre-selected finite set of n descriptors or attributes.The aim is to find out a set of boundaries separating each pair of classes based on the information supplied by a finite set of examples coming from these classes, conventionally referred as a training set.The framework and the corresponding methodology strongly depend on the additional information concerning the set of examples.In a supervised framework for each example the label corresponding to the provenance class is supplied.Therefore, in this case, the basis for deriving a set of separating boundaries is represented by a finite set of labeled examples , where x is the representation of the particular example and y is a code representing the label of the provenance class of x.
Let us assume that we found somehow a suitable set of boundaries that correctly separate the available set of boundaries.Since the information concerning the classes is exclusively contained by the finite training set, there are no guarantees that the set of boundaries are "good enough" in the sense that using them, the provenance class of a new, unseen yet example can be inferred.In other words, the problem of generalization capacities arises in a very natural way.Usually, there is no hint concerning the functional expression one should consider for the separating boundaries.Consequently, we could try to propose some parameterized expressions and fit the parameters against the particular training set.Obviously, the simplest expression of the boundaries is of linear type, but unfortunately, very seldom it happens that the provenance classes can be separated by linear type boundaries and moreover, even the available training set cannot be correctly separated this way.In the following we consider the binary case that is the task is to find out a suitable separating boundary for two classes.Moreover, the aim is to find out a separating boundary of linear type that is a hyperplane in the space of examples that correctly separates the positive examples of the negative ones when this is possible and minimizes the number of misclassifications otherwise.Besides, in order to assure good generalization capacities we would like to find out a separating hyperplane placed at almost equal distance to the positive and negative examples respectively.In order to assure this property, a modern and powerful methodology has been lately developed yielding to the theory of Support Vector Machines (SVM).The fundamentals of this theory were established by Vapnik ([1], [2]).Several refinements have been proposed by many authors as for instance the use of kernels as a tool to maximize the quantity of information extracted from the training data ( [3], [4], [5], [6], [7]).The aims of the paper are multifold, to propose a new method to determine a suitable value of the bias corresponding to the soft margin SVM classifier and to experimentally evaluate the quality of the found value against one of the standard expression of the bias computed in terms of the support vectors.Also, it is proposed a variant of the Platt's SMO algorithm to compute an approximation of the optimal solution of the SVM QP-problem.The new method for computing a more suitable value of the bias is based on genetic search.In order to evaluate the quality of the proposed method from the point of view of recognition and generalization rates, several tests were performed, some of the results being reported in the final section of the paper.

Soft Margin SVM Let
the training dataset.We say that is linearly separable if there exists a hyperplane in the space of inputs separating the positive to the negative examples.There are few known methods that allow to establish whether is linearly separable or not, one of them being the celebrated Ho-Kashyap procedure [8], the computational complexity involved by these methods being substantially high.Consequently, the methods for estimating the parameters of the separating hyperplane facing the possibility that is not linearly separable are looked for.The use of kernels is one of such methods, a kernel "hiding" a not explicitly given nonlinear transform projecting the input data onto a higher dimensional new space in the hope that this way on one hand more information can be extracted from data, and on the other hand the subsets of positive and negative examples, and possibly the representations of the classes become linearly separable.

The Kernel-Based Learning Theory
One of the fundamental mathematical results underlying the kernel-based learning theory is the celebrated Mercer's theorem.Definition.Let A be a compact subset of R n , for some  n N * , and , where R is said to be a positive defined kernel on A if the following conditions hold, C1. for any finite number N and for any finite set of points   and for If K is a positive defined kernel on A, then it induces the integral operator given by, for any The integral operator K L is called the Hilbert-Schmidt operator induced by the kernel K.It can be proved (Mercer, 1908) that K L is a self-adjoint, positive, compact operator having a countable system of non- . Theorem Mercer.Let A be a closed subset of R n and K be continuous symmetric function such that C1 and C2 hold.Then, for any , where the series converges absolutely for each pair   and uniformly on each compact subset of A. is the representation of the example x in the feature space.By construction, the dimensionality of F is determined by the number finite, or denumerable infinite, of positive eigenvalues of the kernel K.In the particular case when the number of positive eigenvalues of the kernel K is finite, say m, then the dimensionality of F equals m and conventionally   x g is represented as a m-dimensional column vector and for any For simplicity sake, in the more general cases when the dimensionality of F is infinite, we extend the notation to represent the inner product defined on F by the series If g is a particular selected feature extractor  A g :

F,
then the function is a semi-positive defined kernel.The kernel "trick" consists in assuming a particular expression for a positive defined kernel K, as for instance a polynomial or exponential expression.According to the Mercer theorem, there exists a feature extractor g such that holds, where neither the explicit functional expression of g nor the dimensionality of F are known.However, this information is not really needed, because the computations involving the kernel K are carried out in the initial n-dimensional space.
Note that the values of   ' , x x K increases as x and x' become "closer", that is the kernels given in Table 1 correspond to some similarity measures on R n .

Table 1. The comparative analysis of the recognition rates
Method Recognition rate Linear discriminant function 83% Quadratic discriminant function 84.50% Mahalanobis-type discriminant function 79% Soft margin SVM -SMO algorithm using the Gauss kernel and the bias given by (10)

87.25%
Soft margin SVM -SMO algorithm using the Gauss kernel and the bias computed by the genetic algorithm 87.75% Soft margin SVM -SMO algorithm using the exponential kernel and the bias given by (10)

84.25%
Soft margin SVM -SMO algorithm using the exponential kernel and the bias computed by the genetic algorithm 87.50%

The non-linear soft margin SVM
The non-linear transform is a vector valued function , the image of in the space being given by , 1,1 . The transform g is a feature extractor, and is the feature space.When the dimension of is finite, say m, a pair , defines the classifier , for any input data x.
The feature extractor g is designed as follows.Let K be a function that "hides" the explicit functional expression of g.Then, the evaluation of the expression is performed exclusively in terms of K and the resulted feature space cannot be explicitly known.The core result in approaches of this type is the celebrated theorem due to Mercer [9].The most frequently used kernel functions are the polynomial kernel, the Gauss Radial Basis Function (RBF) and the exponential RBF, their expressions being respectively.
Let us assume that for a selected kernel K, is non-linearly separable.Let us denote by g the feature extractor such that and a linear classifier in the feature space of parameter , that is, for the input x, if and only if .The model of soft-margin SVM assumes a set of slack variables , where expresses the magnitude of the error committed by for the observation , that is .For any misclassified example , the value of expresses the magnitude of the error committed by the classifier with respect to .The overall importance of the cumulated errors usually can be expressed as where F is a convex and monotone increasing function and is a weight parameter.We obtain a QP problem [10] where C is a conventionally selected constant used to weight the effect of the cumulated errors.Being given its complexity, the problem (2) cannot be solved in this general form, but only for particular functional expressions of F and the weight parameter t.The simplest model uses and , in this case the problem (2) becomes the constrained QPproblem whose dual QP-problem is where To conclude, in order to design a soft-margin SVM, a particular expression of the kernel function K and the magnitude of the constant C have to be selected in advance, then the optimization QP-problem (4) has to be solved.If is a solution of (4), then the parameter is Since the solutions of (4) do not involve the parameter b, its value should be determined such that holds, therefore more options concerning are allowed ([3], [4]).One of the most used expressions of is

Adaptive Algorithms to Approximate the Parameters of a Soft Margin SVM Classifier
According to the arguments supplied in the Section 2, the core problem of the computation of the soft margin SVM separating hyperplane is represented by the QP-problem (4).So far there have been proposed several methods for approximating a solution of (4).

A Variant of Platt's SMO Algorithm
Sequential minimal optimization (SMO) algorithm was introduced by Platt [11] and further extended by several authors ( [12], [13]) is a simple algorithm that allows to solve the SVM-QP problem without extramatrix storage by decomposing the overall QP-problem into simple QP sub-problems similar to Osuna's method [14].The idea of the SMO algorithm [11] is to solve the smallest optimization problem at each step, in case of the QP-problem corresponding to the soft margin SVM, the smallest optimization sub-problem involving only two Lagrange multipliers.In this section we present a variant of the Platt's SMO algorithm to approximate a solution of (4).Let be a kernel satisfying the conditions of the Mercer where is given by (9).The value of the parameter should be adjusted to assure that datasets.

Comparative Analysis
The developments in this section analyze the effects of different choices of the bias parameter on the generalization capacities evaluated in terms of the mean recognition rate corresponding to the resulted classifier.It is well known that the value of the bias parameter cannot be computed by solving the QP-problem (4) and there have been proposed several computation rules expressed in terms of the support vectors, as for instance (6).In our developments we used the expression (10) proposed in [17], where , and SV is the set of support vectors, in order to refine the bias by taking into account the relative importance of the support vectors.
Our tests were performed on artificially generated data from Gaussian repartitions.Also, we used two types of kernels, the Gauss Radial Basis Function (GRBF), and the Exponential Radial Basis Function (ERBF), , where the value of the parameter was determined such that the recognition rate is optimized.For instance, the results of the comparative analysis in case the data were generated from and are presented in   The tests pointed out that the variation of the recognition rates depends on the inner structure of the classes from which the learning data come as well as on their separability degree.Consequently, the results are encouraging and entail future work toward extending these refinements to multiclass classification problems and approaches in a fuzzy-based framework.
where F is called the feature space.Each eigenfunction k  is conventionally referred as a selected feature, the corresponding eigenvalue k  being taken as the value of the feature k  for the example x.The function g is called a feature extractor and for each  x R n ,   x g

Figure 1 ,
Figure 1, Figure 2 and Figure 3 and they are summarized in Table2.The training data consisted of 150 examples coming from each class and the test data contained 400 examples coming from each class.The training and test data are represented in Figure1and Figure2respectively.The computed support vectors are depicted in Figure3.The best recognition rate 87.75% was obtained in case of the variant of SMO algorithm described in Section 3, with GRBF kernel, and the bias computed by the genetic algorithm.