Two ellipsoid Support Vector Machines

doi:10.1016/j.eswa.2014.07.015

Expert Systems with Applications

Volume 41, Issue 18, 15 December 2014, Pages 8211-8224

https://doi.org/10.1016/j.eswa.2014.07.015 Get rights and content

Highlights

•
Easy to implement.
•
Works with any SVM library.
•
Speeds up the SVM training process.
•
Slightly increases classification results.
•
Reduces number of support vectors.

Abstract

In classification problems classes usually have different geometrical structure and therefore it seems natural for each class to have its own margin type. Existing methods using this principle lead to the construction of the different (from SVM) optimization problems. Although they outperform the standard model, they also prevent the utilization of existing SVM libraries. We propose an approach, named 2eSVM, which allows use of such method within the classical SVM framework.

This enables to perform a detailed comparison with the standard SVM. It occurs that classes in the resulting feature space are geometrically easier to separate and the trained model has better generalization properties. Moreover, based on evaluation on standard datasets, 2eSVM brings considerable profit for the linear classification process in terms of training time and quality.

We also construct the 2eSVM kernelization and perform the evaluation on the 5-HT_2A ligand activity prediction problem (real, fingerprint based data from the cheminformatic domain) which shows increased classification quality, reduced training time as well as resulting model’s complexity.

Introduction

Binary classification is a core problem in machine learning, relevant from both theoretical and practical perspectives. Over the past decade, Support Vector Machine (SVM) model (Cortes & Vapnik, 1995) gained a great interest due to its good mathematical formulation accompanied with a great number of empirical results. Several modifications were proposed, ranging from the modifications of norms used in the main SVM’s equation (Zhang, 2004), through considering Bayesian treatment of the problem (Tipping, 2001) to generalization from original separating hyperplanes to hyperspheres and beyond (Le, Tran, Hoang, Ma, & Sharma, 2011). From the perspective of our paper of crucial importance are the generalizations which implement the idea that every class should have its own margin type: Twin Mahalanobis SVM (Peng & Xu, 2012) and Maxi-Min Margin Machine (M⁴) (Huang, Yang, King, & Lyu, 2008).

Most of existing SVM modifications have shown their superiority over the classical method in some contexts and applications, however in practice, Vapnik’s model (with later kernelization) is still the most commonly used. This is caused in particular by the fact that most SVM modifications require considerable amount of time and specialist knowledge to use, while Vapnik’s model is implemented in most machine learning packages.

This is why in this paper we introduce a Two ellipsoid SVM (2eSVM) model which uses two distinct margins’ types and allows easy implementation within the classical SVM framework. In fact we treat SVM as a Black Box, and perform only the pre- and post-processing of the data, see Fig. 1. Our approach allows not only the use of existing SVM libraries, but also gives the ability of careful comparison with the classical SVM modifications like Mahalanobis SVM.

The main idea behind Maxi-Min Margin Machine (M⁴) (Huang et al., 2008) on which we based our ideas, is to seek for the hyperplane which simultaneously maximizes the size of different margins for classes $X_{-}$ and $X_{+}$ . The process of finding the maximal separating margin in the standard SVM algorithm can be seen as searching for the biggest radius r such that the sets $X_{-} + B (0, r) and X_{+} + B (0, r)$ are linearly separable, where $B (0, r)$ denotes the standard ball with radius r centered at zero. From the geometrical and practical point of view it is better to use two different hyperellipsoids $B_{-} (0, r)$ and $B_{+} (0, r)$ (balls in different metrics) fitted for each class separately. Consequently, one seeks for the maximal r such that the sets $X_{-} + B_{-} (0, r) and X_{+} + B_{+} (0, r)$ are linearly separable, see Fig. 2.

As a result, the separating hyperplane is located nearer the “vertical” class. This is a better solution, as a big horizontal variance of the other class suggests that more points drawn randomly from the underlying distributions, which lay on the x axis “between” these two ellipsoids, are actually members of “horizontal” class.

This leads to the Second Order Cone Optimization Problem which cannot be solved by the standard SVM procedure (Huang et al., 2008). We prove however that we can implement a similar principle with the Black Box use of SVM by applying the following steps:

1.
transform the data (or the kernel) using a matrix computed from the sum of classes’ covariances,
2.
train a classical SVM,
3.
shift a decision boundary.

Evaluation on the typical datasets shows that the first operation allows the SVM to separate data easier and faster, while the third helps to obtain better classification results. In particular we obtain (see Section 6):

•
a speed up of the learning process for the (both linear and kernelized) SVM of two to four times,
•
reduction of mean number of support vectors of resulting model for the kernelized SVM by up to 10%,
•
statistically significant better generalization results than the classical approach.

The most important differences between 2eSVM and M⁴ are:

•
2eSVM is much simpler to implement, as it requires just few lines of algebraic operations,
•
2eSVM is much more robust, as it uses a SVM as an underlying optimization problem, which is a quadratic optimization with linear constraints, while M⁴ requires second order cone optimization which is a much more complex optimization problem,
•
M⁴ requires custom optimization, while 2eSVM can be easily integrated with almost any existing SVM library,
•
However, even though 2eSVM implements similar idea to M⁴, due to its approximated nature, it achieves smaller accuracy gain.

The idea behind our method is related to the problem of finding the best metric as well as complexity reduction techniques. Metric learning concerns the problem of finding the best metric for given model and data as the independent optimization problem. Methods of this type have been used to build a hybrid model using both k-nearest neighbors and SVM concept – LM-KNN (Weinberger & Saul, 2009). There have also been presented multiple modifications of Support Vector Machines (Do, Kalousis, Wang, & Woznica, 2012), including incorporating the metric optimization in the core SVM optimization itself (Zhu, Gong, Zhao, & Zhang, 2012). The Ellipsoid SVM model (Momma, Hatano, & Nakayama, 2010) is a particular example of such approach, where one looks for the best fitted hyper-ellipsoid around the data to construct the correct metric. Those methods help model to better fit the underlying geometry of the data at the cost of additional computational requirements and in general – increased complexity of the problem.

On the other hand proximal SVM (Fung & Mangasarian, 2001) changes the basic formulation of the SVM to obtain a much simpler optimization problem. In this setting one searches for two parallel hyperplanes, around which points of particular classes are clustered, which are as far from each other as possible. Twin SVM (Jayadeva, Khemchandani, & Chandra, 2007) generalized this idea, so two hyperplanes can be non-parallel giving model better data geometrical fitting capabilities. Main strength of these methods lies in reduction of the complexity of the optimization problem by either weakening the parameter constraints (Fung & Mangasarian, 2001) or by solving multiple smaller problems (Jayadeva et al., 2007).

The proposed method differs from the above approaches, as it is based on the closed form of pre- and postprocessing methods of the traditional SVM. It does not require solving any additional optimization problems or fitting any extra parameters. It tries to accommodate both these concepts – to better fit the data geometry on one hand, and to reduce the problem’s complexity on the other. One of the most important aspects of such approach is that it can be easily incorporated into existing research methods and tools, without the need of changing the model.

Concluding, the natural application of the proposed model has problems where classes are very diverse in terms of internal geometries. These in particular include:

•
cheminformatical data – described in detail in the evaluation section,
•
Natural Language Processing (Dhanalakshmi et al., 2009, Moraes et al., 2013, Rushdi Saleh et al., 2011) – the internal variety of vocabulary used in different contexts tends to be very diversified,
•
Financial data (Brown & Mues, 2012) – in economical applications it is common to have very small, geometrically condensed class (positive samples) surrounded by big (negative) class.

Our paper is structured as follows: in the next section we show the geometrical intuition of our method, next we focus on the theoretical justification of the proposed approach and show that under reasonable assumptions our algorithm finds a better solution to the classification problem than classical methods. In Section 4 we provide a pseudo-code of 2eSVM, with practical remarks regarding its usage. Next, the proposed approach is compared with a common method of covariance matrix data transformation on both standard and real data (5-HT_2A ligand activity prediction with respect to various fingerprints types (Smusz, Kurczab, & Bojarski, 2013)). We conclude with a short discussion.

Section snippets

Geometrical intuition

Let us present the geometrical motivation behind our idea. In classical preprocessing approach in data analysis one transforms the input space through the $Σ^{- 1 / 2}$ where Σ is a covariance matrix of the whole dataset. This results in the feature space where points are more radially distributed. However, such approach completely ignores the fact that data in separate classes often have different covariances. Moreover, even if inner class covariances are identical, the covariance of the whole dataset

Theory

SVM aims at construction of an (optimal) linear classifier between sets $X_{+}$ (positive examples) and $X_{-}$ (negative examples) in an Euclidean space E. Recall that sets $Y_{-}$ and $Y_{+}$ are linearly separable if there exist nonzero $w \in E$ and $b \in R$ such that $Y_{-} \subset w_{< b} ≔ {x : 〈 w, x 〉 < b} and Y_{+} \subset w_{> b} ≔ {x : 〈 w, x 〉 > b} .$ Thus the largest margin problem which gives the basics of SVM can be stated in the following manner:

Geometrical formulation: classical SVM Given sets $X_{+}, X_{-} \subset E$ , find maximum of $r > 0$ such that the sets

Two ellipsoid SVM formulation

In the previous section we have shown, that the sum of classes’ covariances defines a scalar product which is a good approximation of the corresponding ellipsoid algebraic sum. As a result, we can define linear form of our problem in the language of the SVM optimization by simply substituting the Euclidean distance with the Mahalanobis one using summarized covariance.

Optimization formulation: linear 2eSVM $\begin{matrix} \underset{w, b}{minimize} \frac{1}{2} {‖w‖}_{Σ_{-}^{2} + Σ_{+}} + C \sum_{i = 1}^{N} ξ_{i} \\ subject to y_{i} (〈 w, x_{i} 〉_{Σ_{-} + Σ_{+}} - b) ⩾ 1 - ξ_{i}, i = 1, \dots, N \end{matrix}$

For an efficient

Scalability

First, we show that the proposed method scales well in the linear case. One has to perform the following operations:

1.
computation of the data covariance matrix,
2.
computation of the inversion and square root of the correlation matrix,
3.
dot product between transformed covariance matrix and data.

Let us assume that we are given data $X \in R^{N \times d}$ . Using naive approach, first operation can be performed in $O ({Nd}^{2})$ operations. As a result, it scales linearly with the number of points. The quadratic complexity in

Evaluation

The proposed approach consists of the pre- and postprocessing methods, which can work in the Black Box scenario using any existing SVM implementation. For this reason we do not compare our method to models which alternate the optimization problem and require custom implementation (like Twin Mahalanobis SVM (Peng & Xu, 2012), Ellipsoidal Kernel Machine (Momma et al., 2010) or Maxi-Min Margin Machine (Huang et al., 2008)). In fact, 2eSVM (analogously like SVM or its basic adaptations) would be

Conclusions

In this paper we proposed the data pre- and postprocessing method for binary classification, which is similar to the classical Mahalanobis based transformation but allows to exploit differences between distinct classes. Fitting our method into the standard SVM problem optimization formulation allows us to use it in the Black Box SVM scenario, which makes it widely applicable with vast amount of existing libraries. We give strong theoretical foundations for such method based on the optimal

Acknowledgments

Research of Jacek Tabor was supported by the National Center of Science (Poland) grant no. 2011/01/B/ST6/01887 and work of Wojciech Marian Czarnecki was partially supported by the National Center of Science (Poland) grant no. 2013/09/N/ST6/03015. We would like to thank Sabina Smusz from the Institute of Pharmacology, Polish Academy of Sciences for providing us with cheminformatical data and the insights into the role of fingerprints in the ligand classification.

References (33)

I. Brown et al.
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
Expert Systems with Applications
(2012)
P. Dhanalakshmi et al.
Classification of audio signals using svm and rbfnn
Expert Systems with Applications
(2009)
R. Moraes et al.
Document-level sentiment classification: An empirical comparison between svm and ann
Expert Systems with Applications
(2013)
E. Owusu et al.
A neural-adaboost based facial expression recognition system
Expert Systems with Applications
(2014)
X. Peng et al.
Twin mahalanobis distance-based support vector machines for pattern recognition
Information Science
(2012)
I.T. Podolak
Hierarchical classifier with overlapping class groups
Expert Systems with Applications
(2008)
M. Rushdi Saleh et al.
Experiments with svm to classify opinions in different domains
Expert Systems with Applications
(2011)
N. Saravanan et al.
A comparative study on classification of features by svm and psvm extracted using morlet wavelet for fault diagnosis of spur bevel gear box
Expert Systems with Applications
(2008)
S. Smusz et al.
A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds
Chemometrics and Intelligent Laboratory Systems
(2013)
E.D. Übeyli
Multiclass support vector machines for diagnosis of erythemato-squamous diseases
Expert Systems with Applications
(2008)

L. Zhou et al.

Least squares support vector machines ensemble models for credit scoring

Expert Systems with Applications

(2010)

Bache, K., & Lichman, M. (2013). UCI machine learning repository. URL:...

C. Cortes et al.

Support-vector networks

Machine Learning

(1995)

E. Deadman et al.

Blocked schur algorithms for computing the matrix square root

Do, H., Kalousis, A., Wang, J., & Woznica, A. (2012). A metric learning perspective of svm: On the relation of lmnn and...

G. Fung et al.

Proximal support vector machine classifiers

Cited by (0)

View full text

Two ellipsoid Support Vector Machines

Highlights

Abstract

Introduction

Section snippets

Geometrical intuition

Theory

Two ellipsoid SVM formulation

Scalability

Evaluation

Conclusions

Acknowledgments

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Information Science

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Chemometrics and Intelligent Laboratory Systems

Expert Systems with Applications

Expert Systems with Applications

Support-vector networks

Machine Learning

Blocked schur algorithms for computing the matrix square root

Proximal support vector machine classifiers