Elsevier

Neurocomputing

Volume 276, 7 February 2018, Pages 55-66
Neurocomputing

Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

https://doi.org/10.1016/j.neucom.2017.06.082Get rights and content

Abstract

Abundant data of the patients is recorded within the health care system. During data mining process, we can achieve useful knowledge and hidden patterns within the data and consequently we will discover the meaningful knowledge. The discovered knowledge can be used by physicians and managers of health care to improve the quality of their services and to reduce the number of their medical errors. Since by the usage of a single data mining algorithm, it is difficult to diagnose or predict diseases, therefore in this research, we take a combination of the advantages of some algorithms in order to achieve better results in terms of efficiency. Most of standard learning algorithms have been designed for balanced data (the data with the same frequency of samples in each class), where the cost of wrong classification is the same within all classes. These algorithms cannot properly represent data distribution characteristics when datasets are imbalanced. In some cases, the cost of wrong classification can be very high in a sample of a special class, such as wrongly misclassifying cancerous individuals or patients as healthy ones. In this article, it is tried to present a fast and efficient way to learn from imbalanced data. This method is more suitable for learning from the imbalanced data having very little data in class of minority. Experiments show that the proposed method has more efficiency compared to traditional simple algorithms of machine learning, as well as several special-to-imbalanced-data learning algorithms. In addition, this method has lower computational complexity and faster implementation time.

Introduction

Different methods of data mining can help predict diseases automatically with high accuracy rate. Moreover, additional costs of irrelevant clinical trials will be reduced through this process. It also reduces the wrong predictions due to human tiredness, and consequently improves the quality of services. Some of the data mining methods that have been successfully applied to medical data include: neural networks, decision trees (DT), association rule mining, Bayesian networks, support vector machines (SVM), clustering and etc. Depending on the type of their application, one of these methods will be more useful. However, it is very hard to choose only a data mining algorithm that is suitable to diagnose or predict all diseases. Some algorithms are better than the others for certain purposes. However, when we bring advantages of several algorithms together, it will result in a better performance. Performance criteria will be discussed later in this study. By the way, it is almost impossible to choose the best data mining method to predict diseases for a specific criterion like accuracy, sensitivity and characteristic.

Data analysis and the confusion among them is a problem preventing to achieve remarkable diagnostic results, because the knowledge within the data should be used properly. In fact, data mining is a response to the need of health care organizations. The more data and the complexity of their relations are, the more difficult is to access the hidden information among data. It is often assumed that distribution of classes is balanced or nearly balanced. In general, the cost of wrong classification for all classes is assumed to be the same as well. So when the dataset is imbalanced, these algorithms cannot properly display data distribution features. In a sense, these algorithms tend to put an unknown data into the classes with more frequency, and as a result, it provides unacceptable accuracy among data classes.

An imbalanced dataset is any dataset representing an imbalanced distribution among its classes, in such a way that the imbalanced distribution is too much. This type of imbalance is called inter-classes imbalance (such as a one-to-one thousand distribution (1:1000) where in this case, one class completely eliminates the other one). The imbalanced distribution wasn't between two classes necessarily and there may be among several ones, though. In scientific communities, over 65% rate of a class may be even considered to be imbalanced data [14], [19], [23], [24].

The distributions among many actual datasets are mainly imbalanced, so it is necessary to modify the learning algorithms in order to extract knowledge out of them. As one example of these imbalanced dataset, we can exemplify the data related with the patients with breast cancer. These data are often shown with positive (cancer) and negative (health) classes. As expected, the number of healthy people is much higher than cancerous patients. Therefore, a kind of classification is required which exploits appropriate and balanced prediction accuracy for both minority and majority classes.

As we know that medical diagnosis of a cancerous patient as a healthy individual is unacceptable (and similarly a diagnosis of a healthy person as a patient), so in order to generate decision support systems, modified classifications are required. Applied classifiers must be able to provide high validity for minority class, but also does not affect the validity of majority one. For example, in this case, a healthy sample may be diagnosed 100% correctly, while the correct classification accuracy of the patient is 10%. So, it is very possible that the patient's sample is diagnosed wrongly. In this regard, it is obvious that the single evaluation criteria such as overall accuracy and error rate do not provide enough information about the quality of imbalanced learning. This kind of imbalance is called inherently imbalanced. This means that the imbalance is a direct result of the nature of data space. It is worth mentioning that imbalanced data are not just inherent; and imbalance can be sometimes relative as well, that is, the number of minority samples is naturally large but their number is very low compared to the majority class.

The data complexity is an important issue which includes data overlapping, missing data and etc. This concept is shown in Fig. 1. In Fig. 1, the stars and circles represent the minority and majority classes, respectively. As it is clear, two distributions shown in parts (A) and (B) are imbalanced, but in part (B), there are sample overlapping and multi-concept, too. According to part (B) the sub-concept C may be not learned because of lack of data.

Another form of imbalance is intra-class which corresponds to the distribution of representation data for sub-concepts in a class. In Fig. 1(B), class B and C represent the dominant minority and majority sub-concept, respectively. In addition, A and D are dominant concept and dominant sub-concept for majority class, respectively.

For each class, the number of samples existing in the dominant cluster of that class eliminates the sub-concept. As it is clear, this data space represents inter-classes and intra-class imbalance.

In this paper, we present a new method to classify imbalanced training data, and we compare this method with standard methods such as the nearest neighbor, decision tree and multi-layer perceptron neural network (MLP).

In the following, we review the literature and introduce some works done in this area. Then, we examine the evaluation criteria of these methods and the manner of classification tests. Finally, we will discuss the results of the tests and conclude the paper. In general, contributions presented in this article include:

  • A new method for learning from imbalanced data.

  • An efficient method to be used in the decision support system for breast cancer diagnosis.

  • The results of the proposed method on real dataset of breast cancer.

  • A method for the diagnosis of cardiovascular patients.

Section snippets

Related works

In this section, we review the literature of topic and the previous works. In this paper, training set and the number of its samples is presented by S and m. S={(xi,yj)|i=1,,m} where xi ∈ X is a sample in the n-dimensional characteristic space of X={(f1,f2,,fn)|fiR}and yiY={1,,c} is the label of the class associated with the sample xi. For example, c=2 indicates a classification with two classes. Smin and Smax are sample sets of the minority and majority classes that the union of them is

Evaluation criteria for imbalanced learning

Regarding to the development of researches done in the field of imbalanced learning, it is necessary to present some criteria for evaluating the effectiveness of imbalanced learning algorithms. In this part, we examine the evaluation criteria for imbalanced learning. Conventional evaluation criteria are accuracy rate and error rate. Although these criteria are simple ways to describe the performance of classifier on a dataset, they are not suitable for imbalanced data. Fig. 3 shows the

The proposed method

The main structure of the proposed algorithm named ModifiedBagging is similar to the algorithm EasyEnsemble. Ensemble clustering has been used many time in medical problems [25], [26], [27], [28], [32], [33] In this algorithm, we first select a series of sub-samplings from Smax called Ei where |Ei|=|Smin|. Then we define subsets SiS as Si=SminEi and we train a poor classifier similar to the decision tree on each Si. This classifier is displayed by DTi.

In the end, we consider all these DTi as

Experiments and results

In this article, we have tried to help doctors by providing a machine learning system to diagnose the cancer in patients.

Conclusion

In this paper, a new method was presented for imbalanced learning. This type of learning is special for the datasets in which minority class was much less than the majority one. Also, this method was applied to breast cancer detection problem.

Inability of simple classic learning techniques to learn this type of datasets (imbalanced cancer datasets) was also shown. In addition, due to the lack of minority class data, the specific-purpose methods underperform to learn imbalanced data.

Results of

Acknowledgement

We want to thank from Yasooj Branch, Islamic Azad University, Yasooj, Iran, for their supporting this research.

S. Nejatian obtained Bachelor’s degree in Electrical Engineering. He received the Master’s degree (M.Eng) in Telecommunication Technology, and PhD degree in Data Communication from the University Technology Malaysia in 2008 and 2014, respectively. He holds university Assistant professor position at the Faculty of Electrical Engineering, Islamic Azad University, Yasooj Branch, Yasooj, Iran. His research interests are in, Cognitive Radio Networks, Software Defined Radio,and Wireless Sensor

References (33)

  • Hamzei M., Kangavari M.R.: Learning from Imbalanced Data. Technical Report, Iran University of Science and Technology,...
  • Minaei F., Soleimanian M., Kheirkhah D.: Investigation the relationship between risk factors of occurrence of breast...
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • HeH. et al.

    ADASYN: adaptive synthetic sampling approach for imbalanced learning

  • G.E.A.P.A. Batista et al.

    A study of the behavior of several methods for balancing machine learning training data

    ACM SIGKDD Explor. Newsl.

    (2004)
  • JoT. et al.

    Class imbalances versus small disjuncts

    ACM SIGKDD Explor. Newsl.

    (2004)
  • Cited by (65)

    • ADA: Advanced data analytics methods for abnormal frequent episodes in the baseline data of ISD

      2022, Nuclear Engineering and Technology
      Citation Excerpt :

      Next, we looked at the distribution of sample examples in the dataset. Fig. 10 shows the distribution of samples in our dataset, and it is clear that we have a challenge of an imbalanced classification problem [29]. So, the "normal" instances are majority class, and the "abnormal" examples are minority class.

    • Resampling algorithms based on sample concatenation for imbalance learning

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Decision tree in the original EasyEnsemble is replaced with MLP as the base learning algorithm. —ClusEnsemble [51]: This algorithm uses Bagging as ensemble learning framework and MLP as the base learning algorithm. In each round of Bagging, random undersampling is adopted for generating a balanced dataset and then a base classifier is built on this dataset.

    • Tomato disease and pest diagnosis method based on the Stacking of prescription data

      2022, Computers and Electronics in Agriculture
      Citation Excerpt :

      Ensemble learning models with a single classifier have demonstrated great advantages in classification problems by exploiting the diversity of the underlying models, improving the generalization performance and robustness (Matlock et al., 2018; Sagi and Rokach, 2018). Therefore, integration of classifiers has attracted much attention because of its ability to improve classification accuracy in different applications (Nejatian et al., 2018; Sagi and Rokach, 2018; Su et al., 2009). Various ensemble learning algorithms have been proposed in machine learning, where the most representative methods are Bagging, Boosting and Stacking (Su et al., 2009).

    • FIUS: Fixed partitioning undersampling method

      2021, Clinica Chimica Acta
      Citation Excerpt :

      The rarer class is known as the minority class, and the more frequent class is called the Majority class. There are a large number of studies that attempt to address imbalanced data in different areas such as software defects [5], disease prediction [6], fraud detection [10], and some other fields [11]. Some studies have tried to alleviate the effects of imbalanced class distributions by using different techniques.

    View all citing articles on Scopus

    S. Nejatian obtained Bachelor’s degree in Electrical Engineering. He received the Master’s degree (M.Eng) in Telecommunication Technology, and PhD degree in Data Communication from the University Technology Malaysia in 2008 and 2014, respectively. He holds university Assistant professor position at the Faculty of Electrical Engineering, Islamic Azad University, Yasooj Branch, Yasooj, Iran. His research interests are in, Cognitive Radio Networks, Software Defined Radio,and Wireless Sensor Networks. He is a registered member of professional organizations such as IEEE and IET.

    Eshagh Faraji is a PhD student in Islamic Azad University, Yasooj Branch, Yasooj, Iran in Electrical Engineering Department. His research interests are in the areas of Data Mining, Artificial Intelligence, and Dispatching.

    H. Parvin received a B.E. degree from Shahid Chamran University, Ahvaz, Iran, in 2006 and an M.S. degree from Iran University of Science and Technology, Tehran, Iran, in 2008. From 2008 to 2013, he worked in the Data mining Research Lab, Iran University of Science and Technology, Tehran, Iran. He then received his Ph.D degree Iran University of Science and Technology, Tehran, Iran. Her research interests include data mining, machine learning, and ensemble learning.

    View full text