COMPARATIVE STUDY OF MACHINE LEARNING TECHNIQUES FOR SUPERVISED CLASSIFICATION OF BIOMEDICAL DATA

Several classiﬁcation methods have been widely used in literature for identiﬁcation of diseases or differential diagnosis of various types of disorders. Classiﬁcation methods such as support vector machines, random forests, AdaBoost, deep belief networks, K-nearest neighbors, linear discriminant analysis or perceptron are probably the most popular ones. Even if these methods are frequently used there is a lack of comparison between them to ﬁnd better framework for classiﬁcation. In this study, we compared performance of the above mentioned classiﬁcation methods. The 10-fold cross validation was used to calculate accuracy and Matthews correlation coefﬁcient of the classiﬁers. In each case these methods were applied to eight binary biomedical datasets. The same evaluation was realized also in conjunction with feature selection technique that passed only hundred most relevant features. Even though there is no single classiﬁcation method that dominates in terms of performance, we found that some methods provide more consistent performance than others.


INTRODUCTION
With the expansion of computer methods in bioinformatics and other fields, researchers are more and more frequently faced with machine learning techniques.The machine learning (ML) technology has gained significant attention of biomedical community, mainly because of their potential to improve a process of the disease detection [24].The most often task of any diagnostic system is a determining or attempting to determine disease or disorder based on observation if some signal.The machine learning techniques used for medical diagnosis have to provide high prediction performance, transparency of diagnostic knowledge and have to provide traceable decision [26].
Recent years witnessed development of wide variety of classification algorithms, beginning from relatively simple methods for classification of linearly separable data such as LDA, perceptron algorithm [23] or naive bayes classifier [13], [1] to more sophisticated and complex method such as support vector machines (SVM) [29] or deep learning [16].Moreover, ongoing research produce further extension of existing techniques and methods that are better fitted to particular problem [6], [19].Unfortunately, there are not many direct comparisons of the classification methods, let alone for biomedical applications.Exhaustive comparison of SVM and random forests is given in [25].Studies including more types of classifiers can be found in [18] or [22].
Here we compare eight state-of-the art classifiers on eight biomedical datasets.Eight state-of-the art classifiers are considerd: Support Vector Machines (SVM), AdaBoost clasifier, Random Forests, Deep Belief Networks, K-Nearest Neighbours, Naive Bayes classifier, Linear Discriminant analysis and perceptron classifier.We investigate whether there is one or more techniques that clearly dominate in the terms of performance.Particularly, we focus on two evaluation scenarios.First, when all available features are fed to classifier and second, where only hundred features selected by ensemble FS technique are considered for classification.This paper is organized as follows.Firstly dataset used in this study are described together with details on dataset preprocessing.We continue with brief description of classification algorithms that were applied for class prediction.Finally, classification results in terms of prediction accuracy and Mathews correlation coefficient are given.We conclude the paper with short discussion.

DATA
The datasets used in this study can be freely downloaded from internet or are available upon request from their author.We evaluated four high dimensional, small sample size dataset and four smaller biomedical datasets.All datasets are real-world datasets consisting of two classes.The basic overview of datasets is provided in Tab. 1.
The B2006 [5] dataset is used for molecular classifcation of Crohn's disease, ulcerative colitis based on microarray.We follow the approach taken in original paper and pool Crohn's disease and ulcerative colitis together resulting in one class.
C2006 [7] contains gene signatures of 104 subjects.The dataset is used for prediction of breast cancer.
The acute lymphoblastic leukemia (ALL)/ acute myeloid leukemia (AML) dataset G1999 [10] is one of the first datasets used for molecular classification of cancers based on microarray studies.The set consists of 72 patients (47 ALL + 25 AML) and 7129 genes.G1999 is considered as a two class dataset obtained by merging ALL-T and ALL-B together Last high dimensional dataset G2002 [11] are data used for diagnosis of the lung cancer.
Datasets D2013 [8] and T2014 [27] are new datasets for differential diagnosis of Parkinson's disease from handwriting and speech, respectively.The baseline classification accuracy of these two dataset is significantly lower than for four microarray datasets being below 80 % for D2013 and below 90 % for T2014.
Finally last two datasets Z2014 [30] and K1988 [31] contain in contrast to previous datasets more samples than features.The K1988 dataset is originally four class dataset, however third and fourth class are composed only from six subjects, that is too small to have any statistical power.Therefore we left these samples reducing size of the dataset from 148 to 142 samples.The Z2014 represents highly imbalanced dataset.
Our intention was to include various types of datasets that can be encountered in are of biomedical or bioinformatics research and examine the performance od classification algorithms.

Data Preprocessing and Feature Selection
The data were normalized before classification on a perfeature basis to have zero mean and a standard deviation of one.
The goal of the feature selection is to identify the most relevant features or to remove the noisy features in order to avoid some potential degradation in the predictive power [12].Here, we implement filter features selection.it means that the FS process is applied prior to classification and only features that are evaluated as relevant are fed to the input of classifier.
Instead of implementing single FS algorithm, ensemble techniques apply several weak learners that contribute to final decision.Ensemble FS are especially recommended for small sample domains since they are quite robust to over-fitting and provide stable solutions [28].The extremely randomized trees were used as base learners in this case [9].The number of features selected with FS was N f s = 100.The N f s = 100 was selected similar to other works [14], [21].

SUPERVISED CLASSIFICATION
Based on the types of training data that are available to learner one can distinguish between several learning or classification scenarios.We will consider supervised learning, when learner receives a set of labeled examples as training data and make prediction for all unseen points.However, in practice the amount of labeled data is relatively small and it is inconvenient to set aside validations sample since this would leave insufficient amount of training data.Instead approach known as n-fold cross-validation is used [20].The n-fold cross-validation consist of randomly partitioning dataset into n subsamples (folds).Then, n − 1 folds are used for training and the n-th fold is used as testing dataset.
We consider eight state-of-the art classifiers : Support Vector Machines (SVM), AdaBoost clasifier, Random Forests, Deep Belief Networks, K-Nearest Neighbours, Naive Bayes classifier, Linear Discriminant analysis and perceptron classifier.

Support Vector Machines
The underlying idea of SVM classifiers is to calculate a maximal margin hyperplane separating two classes of the data.To learn non-linearly separable functions, the data are implicitly mapped to a higher dimensional space by means of a kernel function.New samples are classified according to the side of the hyperplane they belong to.We used Radial Basis Function (RBF) kernel [29].The RBF kernel is defined as where γ controls the width of RBF function.

AdaBoost
AdaBoost belongs to the important family of ensemble methods known as boosting.The key idea behind boosting techniques is to use ensemble methods to combine weak classifiers in order to build a strong learner.AdaBoost is an iterative boosting algorithm constructing a strong classifier as a linear combination of weak classifiers, each performing at least above chance level.As a week classifiers we used decision trees [4].Similarly to SVM, we searched grid of possible classifier settings to find optimal performance.The grid was determined by the product of the sets n e = [50, 100, 200](maximum number of estimators at which boosting is terminated), n split = [1, 2, 3, 5, 10] (the number of features to consider when looking for the best split) and n depth = [1, 2, 3, 5, 10] (the maximum depth of the tree).

Random Forests
A drawback associated with decision trees classifiers is their high variance.In order to improve the stability proposed decision forest methodology [17] was proposed and later further improved by Breiman [3] to provide integrated form of random forest classifier.Random forest classifier is ensemble technique that uses an ensemble of unpruned decision trees, each of which is built on a bootstrap sample of the training data using randomly selected subset of variables.We considered different parameter configurations for the values of n tree = [200, 500, 1000] (number of trees to build), m depth = [1, 2, 3, 5, 10] (the maximum depth of the tree) and m s plit = [1, 2, 3, 5, 10] (minimum number of samples required to split an internal node).

Deep Belief Network
It has been show that deep architectures have potential to better represent function than shallow ones [15], [16].The deep belief network (DBN) are formed by stacking restricted boltzman machines (RBM) at the top of each other and train them in the greedy manner.The training strategy for DBNs may hold great promise as a principle to find solution for the problem of training deep networks.Upper layers of a DBN are represent more abstract concepts that explain the input observation, whereas lower layers extract low-level features from data.They learn simpler concepts

K-Nearest Neighbors
In the K-NN algorithm, K -nearest samples in references set are found, by taking a majority vote among the classes of these k samples.The goal is to determine true class of an undefined test pattern through finding of the nearest neighbors within a hyper-sphere of predefined radius.For the K-NN classifier, the optimal parameters were search through grid of values K = [3,5,10,20,50] and n lea f = [10,30,50,100], where K is number of neighbors and n lea f is the leaf size.

Naive Bayes Classifier
A naive Bayes classifier is relatively simple probabilistic classifier applying Bayes theorem with strong independence assumption.Basically, it assumes that the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable.Even though very simple it was frequently used and achieved satisfying results in many classification tasks [13].We particularly used Gaussian Naive Bayes algorithm where the likelihood of the features is assumed to be Gaussian.

Linear Discriminant Analysis
Using linear discrimination we assume that samples of a class are linearly separable from instances of other classes.The LDA classifier is used frequently due to its simplicity.It does not have high computational requirements and linear model is easy to understand.The final output is a weighted sum of the input features.The magnitude of the feature weight shows the importance of particular feature and its sign indicates if the effect is positive or negative.

Perceptron Learning Algorithm
The perceptron is another algorithm for learning weights of features that tries to find linear decision boundary.In fact, if the data is linearly separable, i.e. there exist some hyperplane that puts all positive samples on one side and negative on the other side, then the perceptron will converge to weight vector separating the data [23].

EXPERIMENTAL RESULTS
Classifier validation was conducted using stratified tenfold cross-validation.The process was repeated a total of five times, where in each repetition the original dataset was randomly permuted prior to splitting into training and testing subsets.As an evaluation metric we used conventional classification accuracy and Matthews correlation coefficient (MCC).We decided to use MCC since the classification accuracy alone as an measure is not suitable for imbalanced datasets.MCC takes into account true and false positives and negatives and is generally considered as a balanced measure [2].MCC is defined as where TP is the number of true positives, TN the number of true negatives, FP the number of false positives and FN the number of false negatives.Obviously, the scope of the MCC is within the range of < −1, 1 > ×100%.The larger the MCC value, the better the classifier performance.Firstly, we consider case where all available features are taken into account during classification.Results for classification accuracy and MCC are presented in Tab. 2 and Tab. 3, respectively.One interesting observation is that SVM is clearly outperformed by other methods for classification of high dimensional datasets B2006, C2006, G1999 and G2002 and provide significantly worse results.When comparing classification accuracy of selected methods there is no single method that clearly dominates, however the highest accuracy is achieved using Ada, RF and DBN classifier.This is quite expected since Ada, RF and DBN are methods that can cope also with data that are hard to separate only by linear function.The results are different when considering MCC as performance measure.Even if the MCC of studied methods are not significantly differ-  The feature selection was applied prior to classification in the second evaluation scenario.Only hundred most relevant features are included in classification process.Again classification accuracy and MCC are given in Tab 4 and Tab. 5, respectively.As can be seen, the best results are again provided by more complex methods such as Ada, RF or DBN.In contrast to previous case, where all features were fed to classifier, SVM performs much better and scores the best results in three datasets.This indicates that even if SVM is perceived as robust to overfitting in classification of high dimensional data the opposite is truth.

DISCUSSION
Eight classifiers were compared by the means of classification accuracy and MCC.When number of features is significantly higher than number of samples the best results are achieved by AdaBoost classifier.However the performance of RF and DBN is comparable to that of AdaBoost.Similarly, when the number of features are reduced to hundred, RF, SVM, Ada and DBN provide the most consistent performance.
Other studied techniques (KNN, NB, LDA, perceptron) that are relatively simpler can also achieve competitive performance.In fact perceptron or LDA had highest scores for some databases.However performance of these methods in not so consistent and for some databases are clearly outperformed by Ada or SVM.
When comparing MCC, that is more balanced measure, of classifiers with and without feature selection, there is no significant improvement when employing feature selection.However, it is still true that utilization of FS can reduce computational complexity of subsequent classifier and help to better interpret the results by choosing small number of relevant.

Table 2
Classification accuracy of different ML methods.All features.

Table 3
MCC of different ML methods.All features.

Table 5
MCC of different ML methods.100 best features according to ensemble FS.