Ensemble of Rotation Trees for Imbalanced Medical Datasets

Medical datasets are often predominately composed of “normal” examples with only a small percentage of “abnormal” ones and how to correctly recognize the abnormal examples is very meaningful. However, conventional classification learning methods try to pursue high accuracy by assuming that the number of any class examples is similar to each other, which lead to the fact that the abnormal class examples are usually ignored and misclassified to normal ones. In this paper, we propose a simple but effective ensemble method called ensemble of rotation trees (ERT) to handle this problem in imbalanced medical datasets. ERT learns an ensemble through the following four stages: (1) undersampling subsets from normal class, (2) obtaining new balanced training sets through combining each subset and abnormal class, (3) inducing a rotation matrix on randomly sampling subset of each new balanced set, and in each rotation matrix space, (4) learning a decision tree on each balanced training data. Here, the rotation matrix is mainly to improve the diversity between ensemble members, and undersampling technique aims to improve the performance of learned models on abnormal class. Experimental results show that, compared with other state-of-the-art methods, ERT shows significantly better performance for imbalanced medical datasets.


Introduction
In real world, the medical data often exists class imbalance, where the number of one class examples is larger than other classes [1,2]. For two classes, the examples are usually categorized into normal (negative or majority) and abnormal (positive or minority) classes. The cost of misclassifying abnormal class examples is often higher than misclassifying the normal class ones. For instances, the "mammography" dataset contains 10,923 "healthy" patients and 260 "cancerous" patients and how to recognize the "cancerous" patients is very meaningful. However, traditional learning methods try to achieve high accuracy by assuming that the number of any class examples is similar to each other, which causes that the abnormal class examples are often overlooked and incorrectly classified as normal class [3,4]. Therefore, many approaches have been proposed to tackle the problem.
Sampling technique including undersampling [5], oversampling [6], and SMOTE [7] is one of the most popular methods to solve the problem existing in imbalanced medical datasets. Undersampling technique is to learn models on the rebalanced dataset by sampling a subset of normal class and, unlike undersampling, oversampling rebalances the training dataset by repeating abnormal class examples [1]. SMOTE [7] is another version of oversampling technique, which generates new synthetic abnormal class examples by randomly interpolating pairs of closest neighbors of abnormal class.
Ensemble learning, which has often used to solute challenging issues when traditional classification models have been insufficient such as image detection [8][9][10][11], is another popular method to deal with imbalanced datasets. The proposed class imbalance-oriented ensemble learning methods can be mainly grouped into three categories: (1) bagging-, (2) boosting-, and (3) hybrid-based approaches. Both bagging-and boosting-based approaches often apply sampling technique to ensemble learning process, such as OverBagging, UnderBagging, UnderOverBagging [12], SMOTEBoost [13], and RUSBoost [14]. The former three methods combine bagging with sampling technique, and the latter two methods embed sampling technique into the process of learning each member. EasyEnsemble and BalanceCascade are the two specific examples of hybrid-based approaches [5]. EasyEnsemble undersamples several subsets from the normal class, trains a model using each of them, and combines the outputs of those models. The learning process of BalanceCascade is similar to EasyEnsemble with exception that in each step of training the models, the normal class examples, which are correctly classified by the current trained models, are removed from further consideration.
In this paper, we propose a novel ensemble method called ensemble of rotation trees (ERT) to build accurate and diverse classifiers to tackle class-imbalanced medical datasets. The main heuristics consist of (1) undersampling subsets from normal class, (2) obtaining new balanced training sets through combining each subset and abnormal class, (3) inducing a rotation matrix on randomly sampling subsets from each new balanced set, and in each rotation matrix space, (4) learning a decision tree on each balanced training data. Here, rotation matrix is to improve ensemble diversity, and undersampling technique mainly aims to improve the performance of learned models on abnormal class. The decision tree is selected as the chosen base model because it is sensitive to the rotation of feature axes, hence the name "rotation trees". Compared with other state-of-the-art classification methods, ERT also shows a much better performance on class-imbalanced medical datasets.
This paper extends our previous work [15] in the following respects. First of all, it empirically compares a variety of ensemble method for medical datasets and this has led to new conclusions, such as the fact that the proposed ensemble significantly outperforms other ensemble methods for imbalanced medical datasets. The comparison is based on more medical datasets. Finally, this paper includes more discussion about why the proposed method works.
The rest of this paper is organized as follows: after presenting related work in Section 2, Section 3 describes the proposed learning method for medical datasets, Section 4 presents the experimental results, and finally, Section 5 concludes this work.

Strategies for Imbalanced Medical Datasets
In medical data analysis, it often happens that examples are categorized into an abnormal (minority or positive) group and a normal (majority or negative) group and the cost of misclassifying an abnormal example as a normal example is highly expensive. Take "mammography dataset" as an example. This dataset contains 10,923 "healthy" patients and 260 "cancerous" patients and a naive approach of classifying every example to a "healthy" patient would provide an accuracy of almost 97.68%. Although the naive approach achieves high accuracy, it incorrectly classifies all the "cancerous" patients.
Many techniques have been proposed to handle the imbalanced problem in medical datasets, where the efforts mainly focus on the methods of manipulating datasets and ensemble learning methods.
The methods of manipulating dataset are to rebalance the imbalanced medical data through manipulating data distribution such that traditional methods bias to abnormal class. Reported studies of manipulating datasets can be further subdivided two types: resampling and weighting the data space. Resampling techniques aim to alleviate the effect of classimbalanced distribution through sampling data space to rebalance the corresponding imbalanced dataset. Commonly used sampling techniques are falling to the following three categories: oversampling methods, undersampling methods, and hybrid method. Oversampling techniques try to create new minority class examples to eliminate the harms of imbalanced problem. Randomly duplicating the minority samples and synthetic minority oversampling technique (SMOTE) [7] are the two most popular examples of oversampling techniques. Undersampling techniques, such as random undersampling (RUS) [5], the simplest yet most effective method, try to eliminate the harms of class-imbalanced distribution through removing the examples of the majority class. The hybrid method is a combination of oversampling and undersampling. The strategies of weighting data space adopt information concerning the misclassification costs to adjust the training set distribution, examples including cost-sensitive methods [16] and an ensemble of SVM with asymmetric misclassification costs [1].
Ensemble learning, which generally outperforms single classifiers in class-imbalanced problems [17], and decision trees are popular choices for the base classifiers in an ensemble [18]. According to Galar et al. [19], ensembles for classimbalanced problem can be grouped into three categories: (1) bagging-, (2) boosting-, and (3) hybrid-based approaches. Bagging-based ensemble methods, such as UnderBagging, OverBagging, and UnderOverBagging [12], integrate bagging with resampling technique to improve model's performance on class-imbalanced problem, where UnderBagging uses undersampling technique to preprocess the training set before learning each member. On the contrary to UnderBagging, OverBagging uses oversampling technique instead of undersampling technique to preprocess the training set. UnderOverBagging uses both oversampling and undersampling techniques to adjust data distribution for training individual members. Boosting-based ensembles embed sampling techniques into the learning process of boosting algorithms: alter and bias the weight distribution to train the next classifier toward the abnormal class every iteration. For example, SMOTEBoost [13] uses SMOTE [7] to generate synthetic examples of abnormal class to alter data distribution, and RUSBoost [14] which performs similarly to SMOTEBoost uses RUS [5] to remove examples from the normal class to train base classifiers. Hybrid-based ensembles, such as Easy-Ensemble and BalanceCascade [5], combine bagging with boosting (also with a sampling technique). Both EasyEnsemble and BalanceCascade use bagging as the main ensemble learning method and use AbaBoost as the base classifier learning method. The difference between these methods is the way in which they treat the normal class examples after each iteration. EasyEnsemble does not perform any operation after each AdaBoost iteration. Unlike EasyEnsemble, after learning an AdaBoost, BalanceCascade removes the normal class examples that are correctly classified with higher confidences from further consideration.
Rotation forest, an ensemble learning approach, often performances better than bagging and boosting due to build accurate and diverse classifiers by introducing subsets of features and rotation feature space [20]. This method is also applied to imbalanced problems, for example, Su et al. [21] employed class imbalance-oriented learner, namely, Hellinger distance decision tree (HDDT), as the base classifier of rotation forest to handle class-imbalanced problem, and each base classifier is constructed on the whole training set. Hosseinzadeh and Eftekharia [22] learned rotation forest on the data obtained by preprocessing training set using synthetic oversampling technique (SMOTE) and fuzzy cluster. Fang et al. [23] learned the rotation matrixes on datasets obtained by random undersampling or oversampling (SMOTE) the training set, and each base classifier is constructed on the whole training set.
This paper proposes a novel ensemble method for imbalanced medical datasets. Unlike bagging-, boosting-, and hybrid-based approaches, the proposed method learns each base classifier in rotation matrix space. Unlike conventional rotation forest-based approaches, the proposed method learns both rotation matrixes and base classifiers on the diverse balanced datasets instead of on imbalanced data or on the same data. More details are discussed in Section 3.

Ensemble of Rotation Trees for Imbalanced Medical Datasets
3.1. Ensemble of Rotation Trees. Class-imbalanced problem often exists in medical datasets. This problem causes that traditional classifier learning methods do not work well. This section proposes a novel ensemble method called ensemble of rotation trees (ERT) to handle imbalanced medical datasets. ERT learns an ensemble through the following two steps: (1) sampling subsets from normal class and learning a rotation matrix on each subset and (2) training a tree on the balanced dataset obtained from combining each subset and abnormal class set in the new feature space defined by current rotation matrix. Let x = [x 1 , x 2 ,..., x n ] T be an example of a medical dataset described by n features, and let X a be the abnormal class set in the form of N a × n matrix and X n be the normal class set (in the form of N n × n matrix). Denote by h ∈ H a classifier in the ensemble H and by F, the feature set. Like bagging, all classifiers can be trained in parallel. ERT constructs the current classifier h ∈ H using the following steps: where D n is a subset of X n obtained by randomly undersampling X n without replacement, and |D n | = |X a |.
(ii) Split F randomly into subsets {F j |j = 1, 2,…n/L}. The disjoint subsets are chosen to maximize the chance of high diversity.
(iii) For each F j , draw a subset of size 50 percent from D. Run feature extraction method on F j and the subset to get feature projection components, a j 1 , a j 2 , … , a j L , each of size L × 1.
(iv) Organize the components in a sparse "rotation" Pseudocode 1 shows the pseudocode for the algorithm of ERT. The differences with rotation forest-based classimbalanced methods (refers to Section 2) are mainly reflected in lines 4~5 and lines 14~15. Lines 4~5 construct new balanced training set D i through undersampling subset D n from the normal set X n with the size of equal to that of abnormal set X a . Lines 14~15 learn base classifier h i on the balanced data D i (obtained in steps 4~5) in matrix space R i through projecting D i using R i to obtain a new balanced training set D i, train = D i R i . Therefore, both the rotation matrix R i and base classifier h i are learned from balanced dataset. Besides, unlike conventional rotation forest-based methods, which select and eliminate a random nonempty subspace of classes, ERT does not handle classes due to only two classes used in this paper.
In this paper, we chose decision trees as the base classifiers because they are sensitive to the rotation of the feature axes and still can be very accurate. The feature extraction is based on principal component analysis (PCA) [24] following rotation forest [20]. The running time of ERT is mainly dominated by constructing decision trees, running PCA, and rotating the datasets. Therefore, the computational complexity of ERT is the same to rotation forest [13].
3.2. Discussion. Two issues in ensemble should be addressed for imbalanced medical datasets: high performance of individual ensemble member bias towards abnormal class and the diversity between the members. Undersampling technique is employed to normal class such that individual base classifiers focus more on abnormal class. Specifically, ERT (the proposed method) undersample normal class set such that the learned rotate matrixes capture more on the distribution of the abnormal class set, which enhances the performance of individual classifiers on abnormal class (line 4, Pseudocode 1). Besides, ERT learns each individual classifier on rebalanced dataset obtained by undersampling the training set (lines 15, Pseudocode 1).
Diversity is one major issue to the success of an ensemble, and the intended diversity in the proposed model comes from the following two approaches: (1) the undersampling technique used to sample the normal class (refer to line 4 in Pseudocode 1) and (2)  For example, the probability that all different classifiers of an ensemble with 50 member for n = 9 is less than 0.01, and thus, an extra randomization of the ensemble is meaningful, especially for balanced datasets. Following rotation forest [20], we draw a bootstrap sample of objects, and PCA was applied on the subset.

Evaluation Metrics.
Evaluation metric is extremely essential to assess the effectiveness of an algorithm, and traditionally, accuracy is the most frequently used one. The examples classified by a classifier can be grouped into four categories as shown in Table 1, and thus, accuracy is defined as Accuracy = TA + TN TA + TN + FA + FN 4 However, accuracy is inadequate for imbalanced medical problem and other metrics are proposed, including precision, recall, f-measure, g-mean, and AUC. Precision and recall are, respectively, designed as F-measure is a harmonic mean between recall and precision. Specifically, f-measure is defined as where δ, often set to be 1, is a coefficient to adjust the relative importance of precision versus recall. Like f-measure, g-mean is another metric considering both normal class and abnormal class. Specifically, g-mean measures the balanced performance of a classifier using the geometric mean of the recall of abnormal class and that of normal class. Formally, g-mean is as follows: Besides, AUC is a commonly used measure to evaluate models' performances. According to [25], AUC can be estimated by

AUC = TP/TP + FN + TN/TN + FP 2 8
In this paper, we employ recall, f-measure, g-mean, and AUC to evaluate the classification performance on imbalanced datasets.

Datasets and Experimental
Setup. Eight medical datasets are selected in this paper. All the datasets are two-class imbalanced medical datasets [26]. The imbalanced degree of these dataset varies from 0.061 (highly imbalanced) to 0.349 (only slightly imbalanced), where imbalanced degree is defined as the ratio of the size of the abnormal class to that of the normal class. The details of the datasets are shown in Table 2,

Training: Input:
X a -the abnormal set, X n -the normal set, M-the number of classifiers in the ensemble Output: the ensemble H with M classifiers Begin: 1. i = 0; 2. H =∅; 3. repeat 4. sample a subset D n from X n , |D n | = |X a |; 5. D i = D n ∪ X a ; //balanced dataset 6. Split F into subsets: F i,j for j = 1 … n/L ; 7. j = 0; 8. repeat 9.
Let D i, j be the data set of D i for the feature in F i, j ; 10.
Select a bootstrap sample subset D ' i, j from D i, j of size 50% of the number of objects in D i, j。 Denote as the new set; 11.
Apply PCA on F i, j and D ' i, j to obtain the coefficients in a matrix R i, j ; 12. until j = n/L 13. Arrange the R i, j in a rotation matrix R i as in equation (1) where #Degree is the imbalance degree, #Size is the size of datasets, and #Attrs is the number of attributes. A 10-fold cross-validation [27] is performed to test model performance: each dataset is randomly divided into tenfolds. For each fold, the other ninefolds are used to train a model, and the current fold is to test the model. We run ten times of the 10-fold cross-validation, and therefore, 100 models are constructed for each dataset.
To evaluate the performance of ERT (the proposed method), we compare it with RURF [23], EasyEnsemble [5], BalanceCascade [5], Bagging [28], and C4.5 [29]: (i) RURF is a class imbalance-oriented version of rotation forest (RF) which learns projection matrixes on random undersampling (RU) datasets. C4.5 was selected as the base learner and the number of the base classifiers was set to be 100.
(ii) EasyEnsemble samples T subsets from the normal class and uses AdaBoost with C4.5 as the weak learner to learn M base classifiers on each subset. We set T = M = 10 and therefore 100 trees are learned.
(iii) BalanceCascade is similar to EasyEnsemble except that it removes major class examples that are correctly classified by trained learners from further consideration. T and M are both set to be 10 and therefore 100 trees are learned.
(iv) Bagging learns each base classifier on a resampled dataset. C4.5 is set to be the weak classifier and the number of base classifiers is set to be 100.
(v) ERT is the proposed method in this paper. Here, we set M = 100, namely, the number of bases classifier is 100. C4.5 is used to train base classifiers (refer to Pseudocode 1).

Experimental Results.
To evaluate the performance of ERT (the proposed method), ERT is compared with RURF, EasyEnsemble, BalanceCascade, Bagging, and C4.5 (more details refer to Section 4.2). The corresponding results are reported both in tables and one figure, where four tables report the results of the eight comparing methods on the measures of recall, f-measure, g-mean, and AUC, and the figure reports the ranks of the methods on recall, f-measure, g-mean, and AUC. In these tables, a bullet (an open circle) next to a result indicates that ERT significantly outperforms (is outperformed by) the respective method (column) for respective dataset (row) in pairwise t-test at 0.05 significance level. The last rows in these tables are the average results. The ranks of these methods on measure of recall, f-measure, g-mean, and AUC shown in Figure 1 are calculated as follow [30,31]: on a dataset, the best performing algorithm gets the rank of 1.0, the second best gets the rank of 2.0, and so on. In case of ties, average ranks are assigned. Table 3 and Figure 1(a) show the summarizing results and the ranks of the six comparing methods on measure of recall, respectively. From Table 3, ERT significantly outperforms both bagging and C4.5 on all the eight medical datasets, and the average recall of ERT is 0.2087 higher than C4.5 (recall ∈ [0, 1]). Also, ERT statistically outperforms RURF, EasyEnsemble, and BalancedCascade on eight, seven, and six out of the datasets, respectively, and outperforms them on all datasets. Besides, from Figure 1 Table 4 and Figure 1(b) illustrate the summarizing results and the ranks of ERT, RURF, EasyEnsemble, Bal-anceCascade, Bagging, and C4.5 on f-measure, respectively. From Table 4, ERT shows much better performance comparing to other methods. Specifically, ERT statistically outperforms RURF, EasyEnsemble, and BalanceCascade on four, eight, eight, seven, and seven out of the eight datasets. Figure 1(b) shows that ERT wins on six, eight, eight, seven, and seven out of the eight datasets. Besides, ERT is statistically outperformed by RURF, bagging, and C4.5 on "sick." Combining the results of Table 3 and Figure 1(a), we have that ERT obtains high recall by scarifying the precision of models on "sick." G-mean summaries and the corresponding ranks of ERT, RURF, EasyEnsemble, BalanceCascade, Bagging, and C4.5 are reported in Table 5 and Figure 1(c), respectively. Table 5 shows that ERT significantly outperforms RURF, EasyEnsemble, BalanceCascade, Bagging, and C4.5 on all of the eight datasets, and Figure 1(c) shows that ERT ranks first with average rank 1.0, followed by BalanceCascade (2.9), EasyEnesemble (3.4), RURF(3.5), Bagging (4.5), and C4.5 (5.13). Table 6 and Figure 1(d) depict AUC and the ranks of ERT, RURF, EasyEnsemble, BalanceCascade, Bagging, and C4.5, respectively. Similar to the results on g-mean, ERT significantly wins on all the eight sets comparing to other methods. The average AUC (ranks) of ERT, RURF,

Conclusion
In this paper, we propose a novel method called ensemble of rotation trees (ERT), which aims to build accurate and diverse classifiers to handle imbalanced medical data. The main heuristic consists of (1) sampling subsets from normal class, (2) learning a rotation matrix on each subset, and (3) learning a tree using each subset and abnormal class set in the new feature space. Experimental results show that ERT performs better than other state-of-the-art classification methods on measure of recall, f-measure, g-mean, and AUC on medical datasets.