Imbalanced regression using regressor-classifier ensembles

Orhobor, Oghenejokpeme I.; Grinberg, Nastasiya F.; Soldatova, Larisa N.; King, Ross D.

doi:10.1007/s10994-022-06199-4

Imbalanced regression using regressor-classifier ensembles

Open access
Published: 27 June 2022

Volume 112, pages 1365–1387, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Imbalanced regression using regressor-classifier ensembles

Download PDF

Oghenejokpeme I. Orhobor ORCID: orcid.org/0000-0003-1178-611X¹,
Nastasiya F. Grinberg^2,3,
Larisa N. Soldatova⁴ &
…
Ross D. King^1,5,6

3253 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

We present an extension to the federated ensemble regression using classification algorithm, an ensemble learning algorithm for regression problems which leverages the distribution of the samples in a learning set to achieve improved performance. We evaluated the extension using four classifiers and four regressors, two discretizers, and 119 responses from a wide variety of datasets in different domains. Additionally, we compared our algorithm to two resampling methods aimed at addressing imbalanced datasets. Our results show that the proposed extension is highly unlikely to perform worse than the base case, and on average outperforms the two resampling methods with significant differences in performance.

Federated Ensemble Regression Using Classification

Modified Rule Ensemble Method for Binary Data and Its Applications

Article 01 July 2014

Resampling strategies for imbalanced regression: a survey and empirical analysis

Article Open access 04 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Class imbalance (Ali et al., 2015; Japkowicz & Stephen, 2002) is a primary concern when building classifiers, where the under-representation of one class can lead to sub-optimal predictive accuracy on future samples. In our previous work (Orhobor et al., 2020), we argued that the same can be said for the distribution of a response one is interested in modelling for regression problems. The idea is that for a given response, certain parts of the distribution might be underrepresented, making it difficult for a model built using this data to make accurate predictions for unseen samples falling in the underrepresented regions. To this end, we proposed an ensemble learning approach which takes the underlying distribution into account.

Assume we have some learning data, a discretizer of choice, and a bin size of n. We build a regressor for the base case, where \(n = 1\). For all other values of n where \(n > 1\), and up to some maximum bin size, we split the learning data into n bins using the discretizer. We then build a regressor for each bin in each of the n independent subsets, and a classifier which predicts the probability that a new sample belongs in each of the bins. This classifier is used to combine the predictions from the independent regressors given an unseen sample. Given some unseen data, for each bin size n the regression models produce n predictions, one for each bin. These are combined using bin probabilities provided by the corresponding classifier. Finally, these combined predictions are aggregated using weights that are determined on the learning set. The intuition is that this process should leverage the predictive accuracy of the base and binned models, significantly reducing the possibility of performing worse than the base case, irrespective of the choice of discretizer, which improves upon our previous work (Orhobor et al., 2020). The proposed algorithm is described in Fig. 1 and is formalised in Sect. 3. We demonstrate empirically that it is highly unlikely that the proposed algorithm will perform worse than the base case.

In the next section, we discuss the different techniques comprising the different parts of the proposed extension and the algorithm in general. In Sect. 3 we describe the new algorithm in detail. The evaluation setup is described in Sect. 4 and the results are presented in Sect. 5. We discuss possible improvements to the proposed algorithm and conclude in Sect. 6.

2 Background

Ensemble learning, that is, the aggregation of the predictions made by multiple learning algorithms plays a crucial role in both the original algorithm and the extension we propose. Specifically, we use ensembles in (i) aggregating the predictions made within a given bin size that is not the base case (bin size = 1), and (ii) learning weights to combine the predictions made for each bin size. In the former, there is a classifier and i regressors when bin size is greater than 1. For a new sample, the regressors are used to predict the response while the classifier(s) are used to determine how much we can trust those predictions and weighting them accordingly, see Fig. 1. This differs from a more traditional ensemble learning procedure like stacking in that we do not generate meta-features on which a meta-learner is built, which is then used to aggregate the predictions. This scheme is more closely related to the concept of local classifiers in hierarchical classification (Silla & Freitas, 2011). Furthermore, this can be thought of as a form of dynamic weighting, where instead of having a separate meta-feature generation-aggregation step (Mendes-Moreira et al., 2012), dynamic weighting is implicit in the classifier-regressor aggregation paradigm.

We use a traditional stacking procedure to learn weights to combine the predictions for each bin size (from the base case, where the bin size = 1, to the determined maximum bin size). We generate a meta-feature for each bin size using cross-validation and determine the aggregating weights using a meta-learner. During the testing phase, these weights are applied uniformly to all samples in a test set, making it a static weighting procedure, in contrast to dynamic weighting, where a unique weight is learned for each individual test sample. This approach was chosen to reduce the overall computational complexity of the extended algorithm. However, it is possible that a dynamic weighting paradigm might improve predictive accuracy. We do not prune the meta-feature space, which is one of the main processes in a stacking procedure (Mendes-Moreira et al., 2012). This is because the meta-feature space is not expected to be very large and requiring pruning.

Response discretization is a fundamental part of our algorithm, for which supervised and unsupervised methods have previously been proposed (Dash et al., 2011; Dougherty et al., 1995). Although both supervised and unsupervised approaches are compatible with what we propose, we considered only unsupervised techniques in our evaluation to reduce the computational cost of our experiments. However, one can conjecture that if the supervised methods produce cleaner delineations between the different bins for a given response, it could improve overall predictive accuracy, as the class data on which the classifiers are built would be less noisy. There are other methods which also perform regression by means of classification in an ensemble setting—the work by Ahmad et al. (2018), Halawani et al. (2012) and Gonzalez et al. (2012), for example, are most closely related to what we propose. Ahmad et al. focus on extreme randomized discretization, and use the minimum or maximum of the training data points and bin boundary to estimate the prediction for a new sample, in contrast to our multi-bin and classifier-regressor paradigm. The approach by Gonzalez et al. focus on multi-variate spatio-temporal data and differ from what we propose in two important ways; (a) authors are interested in classifying bands of attributes prior to performing regression, and (b) the final prediction for an unseen sample is obtained by taking the median of the predicted values made by best models which are selected using cross-validation.

An alternative approach to treating imbalanced data is to redress ratio of “rare” and “common” values in a dataset by strategic resampling. Inspired by a similar problem of imbalanced classes in classification, Torgo et al. (2015) proposed two resampling strategies for continuous outcomes in regression analysis. Both methods rely on the notion of relevance (Torgo & Ribeiro, 2007), a function mapping output variable domain to [0, 1]. Relevance is inversely proportional to probability density function of the output variable with rarer observations having higher relevance and more common observations having lower relevance. Choosing a relevance threshold allows one to divide the dataset into two subsets: “common” observations and “rare” observations. Torgo et al. propose two schemes: (i) undersampling of common observations (undersampler) and (ii) undersampling of common observations combined with oversampling of rare observations (SmoteR). Both schemes attempt to create a more balanced dataset allowing for a better representation of values in low probability density areas. Undersampling consists of randomly selecting a subset of normal observations of a pre-specified size. In oversampling, additional “synthetic” rare observations are created by interpolation between existing rare cases and one of its k nearest neighbours. Any regressor can then be applied to a new, re-balanced dataset.

Both, undersampling and SmoteR, are honed towards datasets with rare values concentrated in tail(s) of the distribution. For such distributions, relevance function can be easily calculated and has one or two “relevance bumps”—areas of high relevance, local maxima. However, for more sophisticated distributions (e.g. bimodal), the user has to provide bespoke parameters for construction of the relevance function. This is in contrast to our method, where binning of data is automated via a chosen discretization method. We note that these resampling methods can also be used as part of our proposed framework (see Sect. 6).

3 Methodology

3.1 Algorithm

The original algorithm is described formally in Algorithm 1 and the extension in Algorithm 2. However, we also provide an informal description of the extended algorithm below and a descriptive figure (see Fig. 2). Like the original algorithm, the proposed extension is also split into a build and a prediction phase. In the build phase, the required inputs are a learning set with a continuous response one is interested in modelling, a specified maximum bin size, an unsupervised discretizer, and classification and regression algorithms of one’s choosing. The following steps are performed:

1.
Discretize the learning set into n bins, where n runs from 1 to the maximum bin size. Note that \(n=1\) is the base case and so no discretization is performed. For example, for a maximum bin size of 3, we end up with three response sets: set 1—base case, set 2—learning set discretized into 2 bins, and set 3—learning set discretized into 3 bins (see Fig. 1).
2.
Build a regressor for the base case. For response sets 2 up to the specified maximum bin size, build a single classifier which classifies whether a new sample belongs to a certain bin, and build a regressor for each bin. Therefore, a single classifier and two regressors are built for response set 2, a single classifier and 3 regressors are built for response set 3, and so on.
3.
Learn weights that will be used for combining the predictions from the different response sets using the standard stacking procedure. That is, for each response set, generate a meta-feature using the previous two steps and learn combining weights that will be applied to the predictions made for unseen samples. Using these meta-features, also identify the single best performing bin size, which may be the base case.

It is worth noting that depending on the choice of discretizer and response, one can specify a maximum bin size that is too large, and the number of discretized bins fails to reach the specified size. In this case, the algorithm can be implemented in such a way that it automatically detects the maximum possible bin size before any models are built. This can be achieved by splitting the learning set response into random cross-validation folds, then holding out each fold and attempting to discretize the remaining values at the specified maximum bin size. The minimum discretization size achieved during this cross-validation procedure is chosen as the new maximum. This is how we implemented the algorithm in our evaluation. Given a new sample, the following is performed in the prediction phase:

1.
Make predictions for the base case using the base case regressor. Then for response sets 2 up to the maximum bin size, make predictions using the regressors and aggregate them using the paired classifier.
2.
After the previous step, one should be left with a number of predictions equal to the maximum bin size. These predictions are then aggregated using the weights from (3) in the building phase. The single best performing bin size can also be identified using the results from (3) in the building phase.

3.2 Considerations

A major consideration is that of practicality from a model building computational expense perspective. Imagine that one has a response they are interested in predicting and let s be the determined maximum bin size. This means that \(s-1\) classifiers and \(\sum _{l=1}^{s}l\) regressors will be built, making a total of \((s-1) + \sum _{l=1}^{s}l\sim s^2\) models, excluding any additional models built to determine the best hyperparameters for the chosen learning algorithms. Given the k cross-validation folds required to learn the combining weights, the total number of required models built during the learning phase is \(k((s-1) + \sum _{l=1}^{s}l) + (s-1) + \sum _{l=1}^{s}l \sim ks^2\) or \(\mathcal {O}(s^2)\) models, which is more expensive than the 2s or \(\mathcal {O}(s)\) models built in the original algorithm. Intuitively, the obvious benefits of this additional computational cost are that (i) one need not know the best performing bin size apriori, and can specify an arbitrarily large value for the maximum bin size, which the algorithm will automatically scale to the most viable, and (ii) between the weighted prediction and the identified best bin, one is guaranteed to never do significantly worse than the base case.

Implicit in this implementation is the assumption that one will use upsampling to resolve class imbalance issues. Empirical evidence from our previous work indicates that this is necessary, where we found that upsampling outperformed all other methods for handling class imbalance in this setting (Orhobor et al., 2020). However, replacing it with one’s preferred method is trivial. The only other major decision which is left to the discretion of the user is the choice of discretizer, which can either be done using supervised or unsupervised approaches.

4 Evaluation setup

4.1 Datasets

We evaluated the proposed algorithm and the two resampling methods using 119 responses from a variety of datasets. Specifically, we considered the 20 genes from the LINCS Phase II dataset (Koleti & Terryn, 2017) which we used in the evaluation of the original algorithm (Algorithm 1), 46 traits from the yeast dataset used by Grinberg et al. (2020), 24 drug targets from quantitative structure-activity relationship problems (Olier et al., 2018), 14 responses from the OpenML AutoML Regression Benchmark (Bischl et al., 2017), and 15 responses from work by Branco et al. (2019). We refer to these dataset collections as gene expression, yeast, QSAR, OpenML, and Branco, respectively, during the discussion of our evaluation in Sect. 5. The exact versions of these datasets used in our experiments are available for download from http://dx.doi.org/10.17632/mpvwnhv4vb.2. We used a 70–30% learning and test set split in our evaluation. See Tables in "Appendix A" for the distribution of imbalance in the considered responses as described by Branco et al. (2019).

4.2 Learning setup

We considered the unsupervised frequency interval and clustering (k-means) discretizers (Dougherty et al., 1995). The specified maximum bin size was 10, 5-fold cross-validation was used in the internal stacking procedure to learn weights and identify the best performing individual bin size on the learning set. We used convex linear regression as our weight learning algorithm, which is multiple linear regression with the constraint that its coefficients are non-negative and must sum to 1 (Breiman, 1996). For the classifiers, we used random forests (RF) (Breiman, 2001), naive bayes (NB) (Rish, 2001), decision trees (DT) (Quinlan, 1986), and k-nearest neighbours (KNN) (Altman, 1992). For the regressors, we used the least absolute shrinkage and selection operator (LS) (Tibshirani, 1996), ridge regression (RR) (Hoerl & Kennard, 1970), RF, and eXtreme gradient boosting (XGB) (Chen & Guestrin, 2016). The RF models were built with 1000 trees and a third of the number of attributes was used as the number of splitting variables considered at each node, the defaults were used for everything else. The KNN models were tuned for the k parameter (the number of neighbours) over a range of values: (1, 3, 5, 7, 11, 13, 15). The regularization parameter for both LS and RR was tuned using 10-fold internal cross-validation. Lastly, the learning rate of the XGB models were tuned with (0.001, 0.01, 0.1, 0.2, 0.3) and the number of rounds set to 1500. Default values were used for all other parameters. Note that, out of the four regression models, only RF is capable of handling non-binary categorical input variables. Hence, LS, RR and XGB were not applied to datasets which included such inputs. Predictive performance was measured as the coefficient of determination (\(R^2\)). All experiments were performed in R (Ihaka & Gentleman, 1996), and the code used in our experiments is available at https://github.com/oghenejokpeme/EFERUC. All our results are available for independent analysis at http://dx.doi.org/10.17632/mpvwnhv4vb.2.

4.3 SmoteR and undersampler

We compared our proposed method with two resampling approaches from Torgo et al. (2015), SmoteR and undersampler. For SmoteR, we used a relevance threshold of 0.7, \(k=5\) and all combinations of undersampling rate of \((50, 100, 200, 300)\%\) and oversampling rate of \((200, 500)\%\). For undersampler, we used a relevance threshold of 0.7 and resampling rates of \((50, 100, 200, 300)\%\). These combination of parameters were used in the original paper (Torgo et al., 2015). The results from the best performing parameters are reported. We applied the four regression methods (LS, RR, RF and XGB), which were tuned as detailed above, to each resampled dataset and reported the \(R^2\) on a test set for the best combination of resampling parameters. For majority of outputs, the relevance function was generated automatically (the phi function in the uba package in R) with either one or two relevance bumps corresponding to tail(s) of the distribution. However, some outputs required bespoke relevance functions with areas of importance identified through inspection of the output histograms. We note that the undersampling algorithm and SmoteR failed in the instances when no extreme values were identified by the relevance function and in some rare cases where a dataset contained discrete or categorical input variables with very low-frequency values.

5 Results

5.1 Base regressor performance

We found that the overall best performing regressor in the base case for the frequency and clustering based discretization approaches was XGB, with RF a close second (see Tables 1 and 2). One might ask why we have separated the base case results by discretizer given that the results should be the same either way. This is because we only included base case results where the proposed approach could be successfully executed for at least one regressor-classifier pair given a particular discretizer. This is important because the learning process can fail depending on how a response is discretized. Failure in this sense can be caused by either the inability to discretize a response beyond a bin size of 1, or the discretization could be performed, but all the values in one bin were the same, causing some learners to fail. It is also worth noting that we did not perform experiments for base regressors which only support numeric input variables but the input dataset under consideration has nominal variables with more than two classes. For example, we did not build a LS regressor for fuelCons dataset in the Branco collection (see Table 11 in “Appendix A”). Paired t-tests showed that there is significant difference in performance between the regressor pairs in the base case for some dataset collections but not for others (see Tables 3 and 4).

Table 1 Base case mean predictive performance (\(R^2\)) for the regressors we considered in our evaluation where the discretization approach is frequency for the data collections we considered

Imbalanced regression using regressor-classifier ensembles

Abstract

Similar content being viewed by others

Federated Ensemble Regression Using Classification

Modified Rule Ensemble Method for Binary Data and Its Applications

Resampling strategies for imbalanced regression: a survey and empirical analysis

1 Introduction

2 Background

3 Methodology

3.1 Algorithm

3.2 Considerations

4 Evaluation setup

4.1 Datasets

4.2 Learning setup

4.3 SmoteR and undersampler

5 Results

5.1 Base regressor performance

5.2 Classifier performance

5.3 Discretizer performance

5.4 Algorithm performance

5.5 Comparison to resampling approaches

6 Discussion and conclusion

Availability of data and material

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix A: Dataset details

Appendix A: Dataset details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation