Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

doi:10.1016/j.neucom.2017.06.082

Neurocomputing

Volume 276, 7 February 2018, Pages 55-66

https://doi.org/10.1016/j.neucom.2017.06.082 Get rights and content

Abstract

Abundant data of the patients is recorded within the health care system. During data mining process, we can achieve useful knowledge and hidden patterns within the data and consequently we will discover the meaningful knowledge. The discovered knowledge can be used by physicians and managers of health care to improve the quality of their services and to reduce the number of their medical errors. Since by the usage of a single data mining algorithm, it is difficult to diagnose or predict diseases, therefore in this research, we take a combination of the advantages of some algorithms in order to achieve better results in terms of efficiency. Most of standard learning algorithms have been designed for balanced data (the data with the same frequency of samples in each class), where the cost of wrong classification is the same within all classes. These algorithms cannot properly represent data distribution characteristics when datasets are imbalanced. In some cases, the cost of wrong classification can be very high in a sample of a special class, such as wrongly misclassifying cancerous individuals or patients as healthy ones. In this article, it is tried to present a fast and efficient way to learn from imbalanced data. This method is more suitable for learning from the imbalanced data having very little data in class of minority. Experiments show that the proposed method has more efficiency compared to traditional simple algorithms of machine learning, as well as several special-to-imbalanced-data learning algorithms. In addition, this method has lower computational complexity and faster implementation time.

Introduction

Different methods of data mining can help predict diseases automatically with high accuracy rate. Moreover, additional costs of irrelevant clinical trials will be reduced through this process. It also reduces the wrong predictions due to human tiredness, and consequently improves the quality of services. Some of the data mining methods that have been successfully applied to medical data include: neural networks, decision trees (DT), association rule mining, Bayesian networks, support vector machines (SVM), clustering and etc. Depending on the type of their application, one of these methods will be more useful. However, it is very hard to choose only a data mining algorithm that is suitable to diagnose or predict all diseases. Some algorithms are better than the others for certain purposes. However, when we bring advantages of several algorithms together, it will result in a better performance. Performance criteria will be discussed later in this study. By the way, it is almost impossible to choose the best data mining method to predict diseases for a specific criterion like accuracy, sensitivity and characteristic.

Data analysis and the confusion among them is a problem preventing to achieve remarkable diagnostic results, because the knowledge within the data should be used properly. In fact, data mining is a response to the need of health care organizations. The more data and the complexity of their relations are, the more difficult is to access the hidden information among data. It is often assumed that distribution of classes is balanced or nearly balanced. In general, the cost of wrong classification for all classes is assumed to be the same as well. So when the dataset is imbalanced, these algorithms cannot properly display data distribution features. In a sense, these algorithms tend to put an unknown data into the classes with more frequency, and as a result, it provides unacceptable accuracy among data classes.

An imbalanced dataset is any dataset representing an imbalanced distribution among its classes, in such a way that the imbalanced distribution is too much. This type of imbalance is called inter-classes imbalance (such as a one-to-one thousand distribution (1:1000) where in this case, one class completely eliminates the other one). The imbalanced distribution wasn't between two classes necessarily and there may be among several ones, though. In scientific communities, over 65% rate of a class may be even considered to be imbalanced data [14], [19], [23], [24].

The distributions among many actual datasets are mainly imbalanced, so it is necessary to modify the learning algorithms in order to extract knowledge out of them. As one example of these imbalanced dataset, we can exemplify the data related with the patients with breast cancer. These data are often shown with positive (cancer) and negative (health) classes. As expected, the number of healthy people is much higher than cancerous patients. Therefore, a kind of classification is required which exploits appropriate and balanced prediction accuracy for both minority and majority classes.

As we know that medical diagnosis of a cancerous patient as a healthy individual is unacceptable (and similarly a diagnosis of a healthy person as a patient), so in order to generate decision support systems, modified classifications are required. Applied classifiers must be able to provide high validity for minority class, but also does not affect the validity of majority one. For example, in this case, a healthy sample may be diagnosed 100% correctly, while the correct classification accuracy of the patient is 10%. So, it is very possible that the patient's sample is diagnosed wrongly. In this regard, it is obvious that the single evaluation criteria such as overall accuracy and error rate do not provide enough information about the quality of imbalanced learning. This kind of imbalance is called inherently imbalanced. This means that the imbalance is a direct result of the nature of data space. It is worth mentioning that imbalanced data are not just inherent; and imbalance can be sometimes relative as well, that is, the number of minority samples is naturally large but their number is very low compared to the majority class.

The data complexity is an important issue which includes data overlapping, missing data and etc. This concept is shown in Fig. 1. In Fig. 1, the stars and circles represent the minority and majority classes, respectively. As it is clear, two distributions shown in parts (A) and (B) are imbalanced, but in part (B), there are sample overlapping and multi-concept, too. According to part (B) the sub-concept C may be not learned because of lack of data.

Another form of imbalance is intra-class which corresponds to the distribution of representation data for sub-concepts in a class. In Fig. 1(B), class B and C represent the dominant minority and majority sub-concept, respectively. In addition, A and D are dominant concept and dominant sub-concept for majority class, respectively.

For each class, the number of samples existing in the dominant cluster of that class eliminates the sub-concept. As it is clear, this data space represents inter-classes and intra-class imbalance.

In this paper, we present a new method to classify imbalanced training data, and we compare this method with standard methods such as the nearest neighbor, decision tree and multi-layer perceptron neural network (MLP).

In the following, we review the literature and introduce some works done in this area. Then, we examine the evaluation criteria of these methods and the manner of classification tests. Finally, we will discuss the results of the tests and conclude the paper. In general, contributions presented in this article include:

•
A new method for learning from imbalanced data.
•
An efficient method to be used in the decision support system for breast cancer diagnosis.
•
The results of the proposed method on real dataset of breast cancer.
•
A method for the diagnosis of cardiovascular patients.

Section snippets

Related works

In this section, we review the literature of topic and the previous works. In this paper, training set and the number of its samples is presented by S and m. $S = {(x_{i}, y_{j}) | i = 1, \dots, m}$ where x_i ∈ X is a sample in the n-dimensional characteristic space of $X = {(f_{1}, f_{2}, \dots, f_{n}) | f_{i} \in R}$ and $y_{i} \in Y = {1, \dots, c}$ is the label of the class associated with the sample x_i. For example, $c = 2$ indicates a classification with two classes. S_min and S_max are sample sets of the minority and majority classes that the union of them is

Evaluation criteria for imbalanced learning

Regarding to the development of researches done in the field of imbalanced learning, it is necessary to present some criteria for evaluating the effectiveness of imbalanced learning algorithms. In this part, we examine the evaluation criteria for imbalanced learning. Conventional evaluation criteria are accuracy rate and error rate. Although these criteria are simple ways to describe the performance of classifier on a dataset, they are not suitable for imbalanced data. Fig. 3 shows the

The proposed method

The main structure of the proposed algorithm named ModifiedBagging is similar to the algorithm EasyEnsemble. Ensemble clustering has been used many time in medical problems [25], [26], [27], [28], [32], [33] In this algorithm, we first select a series of sub-samplings from S_max called E_i where $| E_{i} | = | S_{min} |$ . Then we define subsets S_i⊂S as $S_{i} = S_{min} \cup E_{i}$ and we train a poor classifier similar to the decision tree on each S_i. This classifier is displayed by DT_i.

In the end, we consider all these DT_i as

Experiments and results

In this article, we have tried to help doctors by providing a machine learning system to diagnose the cancer in patients.

Conclusion

In this paper, a new method was presented for imbalanced learning. This type of learning is special for the datasets in which minority class was much less than the majority one. Also, this method was applied to breast cancer detection problem.

Inability of simple classic learning techniques to learn this type of datasets (imbalanced cancer datasets) was also shown. In addition, due to the lack of minority class data, the specific-purpose methods underperform to learn imbalanced data.

Results of

Acknowledgement

We want to thank from Yasooj Branch, Islamic Azad University, Yasooj, Iran, for their supporting this research.

S. Nejatian obtained Bachelor’s degree in Electrical Engineering. He received the Master’s degree (M.Eng) in Telecommunication Technology, and PhD degree in Data Communication from the University Technology Malaysia in 2008 and 2014, respectively. He holds university Assistant professor position at the Faculty of Electrical Engineering, Islamic Azad University, Yasooj Branch, Yasooj, Iran. His research interests are in, Cognitive Radio Networks, Software Defined Radio,and Wireless Sensor

References (33)

D. Thammasiri et al.
A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition
Expert Syst. Appl.
(2014)
LiH. et al.
Forecasting business failure: the use of nearest-neighbour support vectors and correcting imbalanced samples-evidence from the Chinese hotel industry
Tourism Manag.
(2012)
CaiQ. et al.
Imbalanced evolving self-organizing learning
Neurocomputing
(2014)
H. Parvin et al.
Proposing a classifier ensemble framework based on classifier selection and decision tree
Eng. Appl. Artif. Intell.
(2015)
H. Parvin et al.
Data weighing mechanisms for clustering ensembles
Comput. Electr. Eng.
(2013)
R. Barandela et al.
Strategies for learning in class imbalance problems
Pattern Recogn.
(2003)
HeH. et al.
Learning from imbalanced data,
IEEE Trans. Knowl. Data Eng.
(2009)
B. Minaei-Bidgoli et al.
Effects of resampling method and adaptation on clustering ensemble efficacy
Artif. Intell. Rev.
(2014)
LiuX.Y. et al.
Exploratory under sampling for class-imbalance learning
IEEE Trans. Syst. Man Cybern. Part B Cybern.
(2009)
ZhangJ. et al.
KNN approach to imbalanced data distributions: a case study involving information extraction

Hamzei M., Kangavari M.R.: Learning from Imbalanced Data. Technical Report, Iran University of Science and Technology,...

Minaei F., Soleimanian M., Kheirkhah D.: Investigation the relationship between risk factors of occurrence of breast...

N.V. Chawla et al.

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

(2002)

HeH. et al.

ADASYN: adaptive synthetic sampling approach for imbalanced learning

G.E.A.P.A. Batista et al.

A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explor. Newsl.

(2004)

JoT. et al.

Class imbalances versus small disjuncts

ACM SIGKDD Explor. Newsl.

(2004)

Cited by (65)

A hybrid deep neural net learning model for predicting Coronary Heart Disease using Randomized Search Cross-Validation Optimization
2023, Decision Analytics Journal
Coronary Heart Disease (CHD) is a life-threatening public health problem. Many chronic CHDs and health risks can be avoided, reversed, and reduced with proper risk assessment. Medical professionals find it challenging to anticipate heart attacks and heart failures since it is a complex process requiring knowledge, experience, and medical resource facilitation. Although healthcare is generally information-savvy, not all available data are analysed to find hidden patterns and make informed and timely decisions since heart disease prediction relies heavily on clinical data processing. This study proposes a hybrid deep neural net learning model for predicting CHD using the BRFSS-2015 Dataset. The best features subset is chosen based on the co-relation score and dataset classes are balanced using the cluster-abundant data class approach. Bi-direction Long Short-Term Memory (BiLSTM) and Gated Recurrent Unit (GRU) hyper-parameter tuning is accomplished using Randomized Search Cross-Validation Optimization (RSCV). In comparison to GRU, LSTM, and BiLSTM-GRU, this suggested model obtains a classification accuracy of 98.28% which outperforms existing models.
ADA: Advanced data analytics methods for abnormal frequent episodes in the baseline data of ISD
2022, Nuclear Engineering and Technology
Citation Excerpt :
Next, we looked at the distribution of sample examples in the dataset. Fig. 10 shows the distribution of samples in our dataset, and it is clear that we have a challenge of an imbalanced classification problem [29]. So, the "normal" instances are majority class, and the "abnormal" examples are minority class.
The data collected by the In-Situ Decommissioning (ISD) sensors are time-specific, age-specific, and developmental stage-specific. Research has been done on the stream data collected by ISD testbed in the recent few years to seek both frequent episodes and abnormal frequent episodes. Frequent episodes in the data stream have confirmed the daily cycle of the sensor responses and established sequences of different types of sensors, which was verified by the experimental setup of the ISD Sensor Network Test Bed. However, the discovery of abnormal frequent episodes remained a challenge because these abnormal frequent episodes are very small signals and may be buried in the background noise of voltage and current changes. In this work, we proposed Advanced Data Analytics (ADA) methods that are applied to the baseline data to identify frequent episodes and extended our approach by adding more features extracted from the baseline data to discover abnormal frequent episodes, which may lead to the early indicators of ISD system failures. In the study, we have evaluated our approach using the baseline data, and the performance evaluation results show that our approach is able to discover frequent episodes as well as abnormal frequent episodes conveniently.
Resampling algorithms based on sample concatenation for imbalance learning
2022, Knowledge-Based Systems
Citation Excerpt :
Decision tree in the original EasyEnsemble is replaced with MLP as the base learning algorithm. —ClusEnsemble [51]: This algorithm uses Bagging as ensemble learning framework and MLP as the base learning algorithm. In each round of Bagging, random undersampling is adopted for generating a balanced dataset and then a base classifier is built on this dataset.
Resampling is the widely used method for imbalance learning. Most existing resampling methods use various techniques in the original sample space to rebalance imbalanced datasets, but they may cause loss of valuable information or aggravate interclass overlap. In this paper, sample concatenation, i.e., concatenating two samples with the same labels into one sample, is introduced into imbalance learning, and a resampling algorithm based on sample concatenation (Re-SC) is proposed. Re-SC transforms an imbalanced training dataset in the original sample space into a concatenated dataset in a new sample space. In the transformation process, Re-SC considers both the distribution of the original dataset and that of the majority samples, thereby alleviating the loss of valuable samples and reducing the class overlapping region. Furthermore, an ensemble resampling algorithm based on sample concatenation (EnRe-SC) for imbalanced data is also proposed. EnRe-SC can reduce the negative effect of removing part of the samples from the majority class. Experiments were conducted on UCI and KEEL imbalanced datasets to evaluate the performance of the proposed methods. The results verify the effectiveness of the proposed methods.
Tomato disease and pest diagnosis method based on the Stacking of prescription data
2022, Computers and Electronics in Agriculture
Citation Excerpt :
Ensemble learning models with a single classifier have demonstrated great advantages in classification problems by exploiting the diversity of the underlying models, improving the generalization performance and robustness (Matlock et al., 2018; Sagi and Rokach, 2018). Therefore, integration of classifiers has attracted much attention because of its ability to improve classification accuracy in different applications (Nejatian et al., 2018; Sagi and Rokach, 2018; Su et al., 2009). Various ensemble learning algorithms have been proposed in machine learning, where the most representative methods are Bagging, Boosting and Stacking (Su et al., 2009).
Crop prescription data contains an extensive amount of information on crops, environment and pests, and has notable diagnostic capabilities. At present, there is lack of feasible methods for efficiently mining crop prescription data to perform accurate diagnoses. In view of the above problems, the purpose of our study is to mine prescription data information and assist the accurate diagnosis of crop diseases. In this paper, six tomato diseases and pests, namely, the tomato virus disease, tomato late blight, tomato gray mold, aphid, thrips and whiteflies, were explored to construct a diagnosis model based on prescription data mining. Original prescription data was subjected to pre-processing, text labeling and one-hot coding. The recursive feature elimination (RFE) method was then employed to extract 37 key features relating to crop diseases and pests from original 50 features. We constructed a tomato disease and pest diagnosis model based on two-stage Stacking ensemble learning to improve the diagnosis accuracy. The experimental results demonstrated the proposed diagnosis model in this paper exhibits a slightly superior performance compared to the best model (LGBM) among ten diagnosis models. The optimal Stacking model is composed of two layers: base-classifiers including GDBT, XGBoost and LGBM, and meta-classifier RF. The diagnosis accuracy of the proposed model for the tomato virus disease reached 94.84%, with an F₁-score of 95.98% and overall accuracy of 80.36%. It also performed well on the multi-classification metrics: Macro avg (Precision: 76.55%, Recall: 78.17%, F₁-score: 77.05%) and Weighted avg (Precision: 80.96%, Recall: 80.36%, F₁-score: 80.50%). Moreover, following feature selection, the Stacking-based diagnosis model can reduce the running time by 12.08% with unchanged diagnosis accuracy. The proposed diagnosis model meets the real-world diagnosis requirements. This work provides new research concepts and a methodological foundation for future crop disease and pest diagnosis.
Simultaneous design of fuzzy PSS and fuzzy STATCOM controllers for power system stability enhancement
2022, Alexandria Engineering Journal
The low frequency oscillations have always been the main problem of power system and can lead to power angle instability, limiting the maximum power to be transmitted on tie-lines and system separation. For boosting power system stability limits, the most effectiveness way is to install supplementary excitation control, power system stabilizer (PSS) to add a supplementary feedback stabilizing signal into the automatic voltage regulator (AVR). This article investigates the coordination and optimization of fuzzy controllers for designing PSS and STATCOM controllers for more attenuation of power system fluctuations. The designed fuzzy controller replaces the STATCOM AC voltage regulator. Moreover, for more damping a fuzzy power system stabilizer (FPSS) is placed on all machines. Coordination between Fuzzy Based STATCOM (FSTATCOM) and FPSS is achieved by Self-Adaptive Learning Bat Algorithm (SALBA) in two stages. At first, scaling factors and then fuzzy sets of membership functions (MFs) will be tuned based on a performance index. To indicate the effectiveness of the proposed scheme, the coordinated optimized FPSS and FSTATCOM are compared with conventional design approaches like conventional PSS (CPSS) and proportional-integral controller based STATCOM (PISTATCOM). The simulations clearly demonstrate the effectiveness of the coordinated fuzzy controllers in terms of transient and dynamic stability.
FIUS: Fixed partitioning undersampling method
2021, Clinica Chimica Acta
Citation Excerpt :
The rarer class is known as the minority class, and the more frequent class is called the Majority class. There are a large number of studies that attempt to address imbalanced data in different areas such as software defects [5], disease prediction [6], fraud detection [10], and some other fields [11]. Some studies have tried to alleviate the effects of imbalanced class distributions by using different techniques.
In the medical field, data techniques for prediction and finding patterns of prevalent diseases are of increasing interest. Classification is one of the methods used to provide insight into predicting the future onset of type 2 diabetes of those at high risk of progression from pre-diabetes to diabetes. When applying classification techniques to real-world datasets, imbalanced class distribution has been one of the most significant limitations that leads to patients’ misclassification. In this paper, we propose a novel balancing method to improve the prediction performance of type 2 diabetes mellitus in imbalanced electronic medical records (EMR).
A novel undersampling method is proposed by utilizing a fixed partitioning distribution scheme in a regular grid. The proposed approach retains valuable information when balancing methods are applied to datasets.
The best AUC of 80% compared to other classifiers was obtained from the logistic regression (LR) classifier for EMR by applying our proposed undersampling method to balance the data. The new method improved the performance of the LR classifier compared to existing undersampling methods used in the balancing stage.
The results demonstrate the effectiveness and high performance of the proposed method for predicting diabetes in a Canadian imbalanced dataset. Our methodology can be used in other areas to overcome the limitations of imbalanced class distributions.

View all citing articles on Scopus

Eshagh Faraji is a PhD student in Islamic Azad University, Yasooj Branch, Yasooj, Iran in Electrical Engineering Department. His research interests are in the areas of Data Mining, Artificial Intelligence, and Dispatching.

H. Parvin received a B.E. degree from Shahid Chamran University, Ahvaz, Iran, in 2006 and an M.S. degree from Iran University of Science and Technology, Tehran, Iran, in 2008. From 2008 to 2013, he worked in the Data mining Research Lab, Iran University of Science and Technology, Tehran, Iran. He then received his Ph.D degree Iran University of Science and Technology, Tehran, Iran. Her research interests include data mining, machine learning, and ensemble learning.

View full text

Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification

Abstract

Introduction

Section snippets

Related works

Evaluation criteria for imbalanced learning

The proposed method

Experiments and results

Conclusion

Acknowledgement

Expert Syst. Appl.

Tourism Manag.

Neurocomputing

Eng. Appl. Artif. Intell.

Comput. Electr. Eng.

Pattern Recogn.

Learning from imbalanced data,

IEEE Trans. Knowl. Data Eng.

Effects of resampling method and adaptation on clustering ensemble efficacy

Artif. Intell. Rev.

Exploratory under sampling for class-imbalance learning

IEEE Trans. Syst. Man Cybern. Part B Cybern.

KNN approach to imbalanced data distributions: a case study involving information extraction

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

ADASYN: adaptive synthetic sampling approach for imbalanced learning

A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explor. Newsl.

Class imbalances versus small disjuncts

ACM SIGKDD Explor. Newsl.