Strathprints Institutional Repository Combining MLC and SVM Classifiers for Learning based Decision Making: Analysis and Evaluations

: Maximum likelihood classifier (MLC) and support vector machines (SVM) are two commonly used approaches in machine learning. MLC is based on Bayesian theory in estimating parameters of a probabilistic model, whilst SVM is an optimization based non-parametric method in this context. Recently, it is found SVM in some cases is equivalent to MLC in probabilistically modeling of the learning process. In this paper, MLC and SVM are combined in learning and classification, which helps to yield probabilistic output for SVM and facilitate soft decision making. In total four groups of data are used for evaluations, covering sonar, vehicle, breast cancer and DNA sequences. The data samples are characterized in terms of Gaussian/non-Gaussian distributed and balanced/unbalanced sampled, which are then further used for performance assessment in comparing the SVM and the combined SVM-MLC classifier. Interesting results are reported to indicate how the combined classifier may work under various conditions.


Introduction
Maximum likelihood classification (MLC) is one of the most commonly used approach in signal classification and identification, which has been successfully applied in a wide range of engineering applications including classification for digital amplitude-phase modulations [1], remote sensing [2], genes selection for tissue classification [3], nonnative speech recognition [4], chemical analysis in archaeological applications [5] and speaker recognition [6].On the other hand, support vector machines (SVM) has attracted much increasing attention, which can be found in almost all areas when prediction and classification of signal is required, such as scour prediction on grade-control structure [7], fault diagnosis [8], EEG signal classification [9], and fire detection [10] as well as road sign detection and recognition [11].
Based on the principles of Bayesian statistics, MLC provides a parametric approach in decision making where the model parameters need to be estimated before they are applied for classification.On the contrary, SVM is a non-parametric approach, where the theoretic background is supervised machine learning.Due to the differences of these two classifiers, their performance appears to be much different.Taking the application in remote sensing for example, in Pal and Mather [12] and Huang et al [13], it is found SVM outperforms MLC and several other classifiers.In Waske and Benedijktsson [14], SVM produces better results from SAR images, yet in most cases it generates worse results than MLC from TM images.In Szuster et al [15], SVM only yields slightly better results than MLC for land cover analysis.As a result, detailed assessments as on what conditions SVM outperforms or appears inferior to MLC are worth further investigation.
Furthermore, there becomes a trend to combine the principle of MLC, Bayesian theory, with SVM for improved classification.In Ren et al [16], Bayesian minimum error classification is applied to the predicted outputs of SVM for error-reduced optimal decision-making.Similarly, in Hsu et al [17], Bayesian decision theory is applied in SVM for imbalance measurement and feature optimization for improved performance.In Vega et al [18], Bayesian statistics is combined with SVM for parameter optimization.In Vong et al [19], Bayesian inference is applied to estimate the hyper-parameters used in SVM learning to speed up the training process.In Foody [20], relevance support machine (RVM), a Bayesian extension of SVM is proposed which enables an estimate of the posterior probability of class membership where conventional SVMs fail to do so.Consequently, in-depth analysis of the two classifiers is desirable to discover their pros and cons in machine learning.
In this paper, analysis and evaluations of SVM and MLC is emphasized, using data from various applications.Since the selected data satisfy certain conditions in terms of specific sample distributions, we aim to find out how the performance of the classifiers is connected to the particular data distributions.As a consequence, the work and the results shown in the paper are valuable for us to understand how these classifiers work, which can then provide insightful guidance as how to select and combine them in real applications.
The remaining parts of the paper are organized as follows.Section 2 introduces the principles of the two classifiers.Section 3 describes data and methods that have been used, where experimental results and evaluations are analyzed and discussed in Section 4. Concluding remarks are given in Section 5.

MLC and SVM Revisited
In this section, the principles of the two classifiers, SVM and MLC, are discussed.By comparing their theoretic background and implementation details, the two classifiers are characterized in terms of their performances during the training and testing processes.This in turn has motivated our work in the following sections.

Let
denotes the class label associated with i x , i.e. in total we have C classes denoted as c  ,

 c
. The basic assumption of MLC is that for each class of data the feature space satisfies specified distributions, usually Gaussian, and also the samples are independent to each other.To this end, the likelihood (probability) for samples within the kth class, k  , is given as follows.
where c and c S respectively denote the mean vector and co-variance of all c N samples within c  , which can be determined using maximum likelihood estimation as For a given sample i x , the probability it belongs to class c  can be denoted as Based on Bayesian theory, we have Since ) ( i p x is a constant in Eq. ( 5) when i x is given, Eq. ( 4) can be re-written as Applying logarithm operation to the right side of Eq. ( 6), also let Again we can ignore the constant in Eq. ( 8) and simplify the discriminating function as When the class c is specified, these parameters are determined, hence the quadratic function only depends on the class c and the input sample x .Also it is worth noting that the third item c  is actually a constant.
In a particular case when ) ( c p  is a constant for all c , i.e. the prior probability that a sample belongs to one of the classes is equal, ) ( ln c p  in Eq. ( 9) can be ignored hence the discriminating function is re-written as (10) where the scalar 1/2 is also ignored as it makes no difference when Eq. ( 7) is applied for decision-making.However, such simplification cannot be made unless we have clear knowledge about the equal distribution of the samples over the C classes.
Based on Eq. ( 10), the decision function can be further simplified if the total number of classes is reduced to two, where the two classes are denoted as -1 and 1 and the sign function in introduced for simplicity.
Moreover, in a special case when , the quadratic decision function in Eq. ( 11) becomes a linear one as

The Support Vector Machine (SVM)
SVM was originally developed for the classification of two-class problem.In Cortes and Vapnik [22], the principles of SVM are comprehensively discussed.Let the two classes denoted as 1 and -1, similar to the decision function for MLC in Eq. ( 11), the decision function for linear SVM is given by b g where i y denotes the labeled value for the input sample i x ; w and b are parameters to be determined in the training process.
Note that the decision function in Eq. ( 13) is actually equivalent to the one in Eq. ( 11) if we adjust the scalar for b , yet Eq. ( 13) is more feasible as it has increased the decision margin between the two classes from near zero to To determine this optimal hyperplane, we need maximize . Using the Lagrangian multipliers, this optimization problem can be solved by Eventually, the parameters o w and o b are decided as 0 , For any non-zero i  , the corresponding i x is denoted as one support vector which naturally satisfies 1 ) . Therefore, o w is actually the linear combination of all support vectors.Also we have 0  Eventually if we combine Eq. ( 17) with Eq. ( 13), the discrimination function for any test sample x becomes b y g which solely relies on the inner product of the support vector and the test sample.
For nonlinear problems which are not linearly separable, the discrimination function is extended as where  aims to map the input samples to another space thus makes them linearly separable.Another important step is to introduce the kernel trick to calculate the inner product of mapped samples, i.e.
, which avoids the difficulty in determining the mapping function  and also the cost for calculation of the mapped samples and their inter-product.Several typical kernels including linear, polynomial and radial basis function (RBF) are summarized below.
x x (20) where optimal values for the associated parameters p and  are determined automatically during the training process.
Though SVM is initially developed for two-class problems, it has been extended to deal with multi-class classification based on either combination of decision results from multiple two-class classification or optimization on multi-class based learning.Some useful further readings can be found in [23], [24] and [25].

Analysis and Comparisons
MLC and SVM are two useful tools for classification problems, where both of them rely on supervised learning in determining the model and parameters.However, they are different in several ways as summarized below.
Firstly, MLC is a parametric approach which has a basic assumption that the data satisfy Gaussian distribution.On the other contrary, SVM is a non-parametric approach and it has no requirement on the prior distribution of the data, yet various kernels can be empirically selected to deal with different problems.
Secondly, for MLC the model parameters, c and c S , can be directly estimated using the training data before they are applied for testing and prediction.However, SVM relies on supervised machine learning, in an iterative way, to determine a large amount of parameters including o w , o b , all non-zero i  and their corresponding support vectors.
Thirdly, MLC can be straightforward applied to two-class and multi-class problems, yet additional extension is needed for SVM to deal with multi-class problem as it is initially developed for two-class classification.
Finally, a posterior class probabilistic output for the predicted results can be intuitively generated from MLC, which is a valuable indicator for classification to show how likely a sample belongs to a given class.For SVM, however, this is not an easy task though some extensions have been introduced to provide such an output based on the predicted value from SVM.In Platt [26], a posterior class probability i p is estimated by a sigmoid function below. ) The parameters A and B are determined by solving a regularized maximum likelihood problem as follows.

N and 1 
N denote the number of support vectors labeled in class 1 and -1, respectively.In addition, in Lin et al [27] Platt's approach is further improved to avoid any numerical difficulty, i.e. overflow or underflow, in determining i Although there are significant differences between SVM and MLC, the probabilistic model above has uncovered the connection between these two classifiers.Actually, in Franc et al [21] MLC and SVM are found to be equivalent to each other in linear cases, and this can also be convinced by the similar decision functions in Eq. (11) and Eq. ( 13).

Data and Methods
In this paper, analysis and evaluations of SVM and MLC is emphasized, using data from various applications.Since the selected data satisfy certain conditions in terms of specific sample distributions, we aim to find out how the performance of the classifiers is connected to the particular data distributions.As a consequence, the work and the results shown in the paper are valuable for us to understand how these classifiers work, which can then provide insightful guidance as how to select and combine them in real applications.

The datasets
In our experiments, four different datasets, SamplesNew, svmguide3, sonar and splice, are used.Among these four datasets, SamplesNew is a dataset of suspicious micro-classification clusters extracted from [16] and svmguide3 is a demo dataset of practical svm guide [28], whilst sonar and splice datasets come from the UCI repository of machine learning databases [29].Actually, two principles are applied in selecting these datasets: The first is how balanced the samples are distributed over two classes, and the second is whether the feature distributions are Gaussian-alike.As can be seen, the first two datasets are severely imbalanced, especially the first one, as there are much more data samples in one class than those in another class.On the other hand, the last two datasets are quite balanced.Regarding feature distributions, samplesNew and svmguide3 are apparently non-Gaussian distributed, yet the other two, sonar and splice, show approximately Gaussian characteristics when the variables are separately observed.This is also validated by the determined Pearson's moment coefficient of skewness below [30], where i  and i  are the mean and standard deviation for the th i dimension of the dataset, and (.) E refers to mathematical expectation.When the skewness coefficients are determined for each data dimension, the maximum, the minimum and the average skewness coefficients are obtained and shown in Table 1 for comparisons.

The approach
In our approach, a combined classifier using SVM and MLC is applied, which contains the following three stages.In Stage 1, SVM is used for initial training and classification.For the correctly classified results in SVM, these are employed in Stage 2, where MLC is applied for probability based modeling.The probability-based models are then utilized in Stage 3 for improved decision making and better classification.Details of these three stages are discussed as follows.

Stage 1: SVM for initial training and classification
The open source library libSVM [28] is used for initial training and classification of the aforementioned four datasets, and both the linear and the Gaussian radial basis (RBF) kernels are tested.For each group of datasets, all the data are normalized to [-1, 1] before SVM is applied.Through 5-fold cross validation, the best group of parameters, including the cost and the gamma value, are optimally determined.Eventually, the optimal parameters are used for classification of our datasets.
In our experiments, the training ratios are set at three different levels, i.e. 80%, 65% and 50%.Basically, there is no overlap between training data and testing data.At a given training ratio, the training data is randomly selected and repeated five times, which leads to 5 groups of test results generated.Finally, the average performance over these five experiments is used for comparisons.

Stage 2: Using MLC for probability-based modeling
For those correctly classified samples, which lie in two classes, i.e. class 0 and class 1, they are taken to decide two probability-based models, in a way as discussed in MLC.In other words, for samples correctly classified in class 0, they are used to determine the mean vector and the corresponding co-variance matrix within class 0. On the other hand, samples which are correctly classified in class 1 are used to determine the mean vector and the corresponding co-variance matrix within class 1.Note that not all samples in class 0 or class 1 are used in calculating the related MLC models, as those which cannot be correctly classified by SVM are treated as outliers and ignored in MLC modeling for robustness.
After MLC modeling, for each sample x , the associated likelihoods that it belongs to the two classes are re-calculated and denoted as . As a result, the decision for classification is simplified as where  is a threshold to be optimally determined to generate the best classified results.Please note that the likelihoods (or probability values) here can also be taken as a probabilistic output of the SVM.

Stage 3: Improved classification
With the estimated MLC models and the optimal threshold  , all samples are then re-checked for improved classification, using (25) and the determined likelihoods ) ( 0 x p and ) ( 1 x p , accordingly.Interesting results on these four datasets are given and analyzed in detail in the next Section.

Results and Evaluations
For the four datasets discussed in Section 3, the experimental results are reported and analyzed in this section.Firstly, we discuss results from a combined classifier of MLC and a linear SVM.Then, results from MLC and RBF based SVM are compared.In addition, how different re-balancing strategies affect the performance of unbalanced datasets is also discussed.

Results from a linear SVM and the MLC
In this group of experiments, a combined classifier using a linear SVM and the MLC is employed, and the relevant results are presented in Fig. 1.In Fig. 1, we plot the classification rate as the prediction accuracy with the change of training ratio, i.e. the percentage of data used for training.Three training ratios, 80%, 65% and 50% are used.Please note that due to degradation of the co-variance matrix, the MLC cannot be used to improve the results for the SampleNew dataset.Consequently, the results from the SVM are taken as the output of the combined classifier.For the other three datasets, the results are summarized and compared as follows.
Firstly, for the three datasets, Sonar, Splice and svmguide3, apparently we can see that the combined solution yield significantly improved results in training, especially for the first two datasets.This demonstrates that the combined classifier can indeed achieve more accurate modeling of the datasets.In addition, possibly due to over-fitting, it shows that a larger training ratio does not necessarily improve the training performance.However, the testing results are some different.For the Sonar dataset, which is balanced and appears near Gaussian distributed, the combined classifier yields much improved results in testing, especially when the training ratios are 80% and 50%.Such results are not surprising as the MLC is ideal to model Gaussian-alike distributed datasets.For the Splice dataset, which is balanced and also nearly Gaussian distributed, slightly improved testing results are also produced by the combined classifier at training ratios at 80% and 50%, but the testing results at the training ratio of 65% becomes slightly worse than those from the SVM.For the more challenging svmguide3 dataset, which is unbalanced and non-Gaussian distributed, although the combined classifier yields improved testing results at the training ratio of 50%, the results at the other two training ratios, perhaps due to over-fitting, seem inferior to the results from the SVM.Actually, in nature the MLC has difficulty in modeling non-Gaussian distributed datasets, and this explains where the combined classifier makes less contribution to these datasets.

Results from a RBF-kernelled SVM and the MLC
In this group of experiments, the RBF kernel is used for the SVM in the combined classifier as it is popularly used in various classification problems [16,23].For the four datasets we used, again the training results and the testing results under three different training ratios are summarized and given in Fig. 2 for comparisons.
First of all, RBF-kernelled SVM (R-SVM) produces much improved results than those using linear SVM, especially for the training results.In fact, the combined classifier generates better results than the SVM only in the SampleNew dataset, slightly worse results in sonar and splice datasets, and much degraded results in the svmguide3 dataset.
Regarding testing results, although the combined classifier generates comparable or slightly worse results in the SampleNew dataset and the svmguide3 dataset, R-SVM yields better results in splice dataset and sonar dataset.The reason behind is that results from the non-linear kernel in R-SVM cannot be directly refined using MLC.Also, occasionally the results from the combined classifier seem more sensitive to the training ratio, especially for the splice dataset, which is perhaps due to the threshold to be determined depends more or less on the training data used.

Testing on Re-balanced Data
In this group of experiments, using the challenging dataset svmguide3, how various strategies to rebalance the unbalanced data may affect the classification performance is analyzed.For the unbalanced dataset, samples from one class may be over-represented than those in another class.As a result, we can either over-sampling the data of minority or sub -sampling the data of majority to balance the number of samples represented in the training set for better modeling of the data.On the other hand, the test samples remain to be unbalanced as it is assumed we have no label information for the test samples.
For over-sampling, data samples which are in minority class are randomly duplicated and inserted into the dataset.The replication of data items continues until the entire training set becomes balanced.Different from over-sampling, sub-sampling randomly discards samples from the majority class until the training set achieves balanced.Since the performance may be affected by samples duplicated or discarded, this process is repeated for over 10 times and the average performance is then recorded for comparisons.
Using three different training ratios at 80%, 65% and 50%, results of balanced learning for the svmguide3 dataset are summarized in Fig. 3.Under a given training ratio, both training results and testing results are presented in groups, where each group contains results from 6 different experimental scenarios.In addition, the results from liner SVM and RBF-kernelled SVM are shown for comparisons as well.
When linear SVM is used, as shown in the first row of Fig. 3, surprisingly, the results from unbalanced data are much better than those from balanced data.Also in majority cases, the combined classifier outperforms the SVM classifier in both training and testing, even with balanced learning introduced.The testing results from SVM for balanced learning via over-sampling seem better than those from sub-sampling, yet it seems that the combined classifier produces better results from sub-sampling based balanced learning.
For RBF-kernelled SVM, apparently, the training results from SVM via over-sampling are among the best, though the testing results are inferior to those from un-balanced training.This indicates that the training process has been over-fitting in this context.In fact, testing results from the combined classifier are slightly worse than those from the SVM classifier, i.e. some degradation.Again, this is caused by the inconsistency of the non-linear SVM and the linear nature of the MLC.

Conclusions
SVM and MLC are two typical classifiers commonly used in many engineering applications.Although there is a trend to combine MLC with SVM to provide a probabilistic output for SVM, under what conditions the combined classifier may work effectively needs to be explored.In this paper, comprehensive results are demonstrated to answer the question above, using four different datasets.First of all, it is found that the combined classifier works under certain constraints, such as a linear SVM, balanced dataset and near Gaussian-distributed data.When a RBF-kernelled SVM is used, the combined classifier may produce degraded results due to the inconsistency between the non-linear kernel in SVM and linear nature of MLC.In addition, for a challenging dataset, balanced learning may improve the results of training but not necessaries the testing results.The reason behind is that the combined SVM-MLC classifier works on three assumptions, i.e.Gaussian distributed, inter-class separable, and model consistency between training data and testing data.Although the third assumption is true in most cases, the precondition of separable Gaussian distributed data is rather a strict constraint for data and rarely be satisfied.As a result, this introduces a fundamental difficulty in combining these two classifiers.However, under certain circumstances, the combined classifier indeed can significantly improve the classification performance.It is worth noting that when more groups is introduced in modelling a given dataset the efficacy can be severely degraded due to the inconsistency of statistical distribution between groups.Future work will focus on combining other classifiers such as neural network for applications in medical imaging [31][32][33] and recognition and classification tasks [34][35].
a quadratic function of x depending on three parameters, i.e. c u , c S and ) ( c p  .

y
to both sides of the discriminating function g , this can be further simplified as 1 optimal hyperplane to separate the training data with a maximal margin is defined by o b are the determined parameters, and the maximal distance becomes

Figure 1 .
Figure 1.Comparing training (top) and testing results (bottom) using linear SVM and the combined classifier for the four datasets under three different training ratios.

Figure 2 .
Figure 2. Comparing training (top) and testing results (bottom) using RBF kernelled SVM and the combined classifier for the four datasets under three different training ratios.