1 Introduction

Class imbalance (Ali et al., 2015; Japkowicz & Stephen, 2002) is a primary concern when building classifiers, where the under-representation of one class can lead to sub-optimal predictive accuracy on future samples. In our previous work (Orhobor et al., 2020), we argued that the same can be said for the distribution of a response one is interested in modelling for regression problems. The idea is that for a given response, certain parts of the distribution might be underrepresented, making it difficult for a model built using this data to make accurate predictions for unseen samples falling in the underrepresented regions. To this end, we proposed an ensemble learning approach which takes the underlying distribution into account.

Assume we have some learning data, a discretizer of choice, and a bin size of n. We build a regressor for the base case, where \(n = 1\). For all other values of n where \(n > 1\), and up to some maximum bin size, we split the learning data into n bins using the discretizer. We then build a regressor for each bin in each of the n independent subsets, and a classifier which predicts the probability that a new sample belongs in each of the bins. This classifier is used to combine the predictions from the independent regressors given an unseen sample. Given some unseen data, for each bin size n the regression models produce n predictions, one for each bin. These are combined using bin probabilities provided by the corresponding classifier. Finally, these combined predictions are aggregated using weights that are determined on the learning set. The intuition is that this process should leverage the predictive accuracy of the base and binned models, significantly reducing the possibility of performing worse than the base case, irrespective of the choice of discretizer, which improves upon our previous work (Orhobor et al., 2020). The proposed algorithm is described in Fig. 1 and is formalised in Sect. 3. We demonstrate empirically that it is highly unlikely that the proposed algorithm will perform worse than the base case.

Fig. 1
figure 1

Representation of the proposed approach when the maximum bin size is 3. Note that the weights (\(w_1, w_2, w_3\)) are learned using cross-validation on the learning set using stacking. \(R_i^j\) represents the regressor for the jth bin when the bin size is i. \(p_i^j\) is the predicted probability by the \(C_i\) classifier that a new sample belongs to the jth bin when bin size is i. In the diagram, \(R_1^1\) is the base case regressor and does not have a classifier

In the next section, we discuss the different techniques comprising the different parts of the proposed extension and the algorithm in general. In Sect. 3 we describe the new algorithm in detail. The evaluation setup is described in Sect. 4 and the results are presented in Sect. 5. We discuss possible improvements to the proposed algorithm and conclude in Sect. 6.

2 Background

Ensemble learning, that is, the aggregation of the predictions made by multiple learning algorithms plays a crucial role in both the original algorithm and the extension we propose. Specifically, we use ensembles in (i) aggregating the predictions made within a given bin size that is not the base case (bin size = 1), and (ii) learning weights to combine the predictions made for each bin size. In the former, there is a classifier and i regressors when bin size is greater than 1. For a new sample, the regressors are used to predict the response while the classifier(s) are used to determine how much we can trust those predictions and weighting them accordingly, see Fig. 1. This differs from a more traditional ensemble learning procedure like stacking in that we do not generate meta-features on which a meta-learner is built, which is then used to aggregate the predictions. This scheme is more closely related to the concept of local classifiers in hierarchical classification (Silla & Freitas, 2011). Furthermore, this can be thought of as a form of dynamic weighting, where instead of having a separate meta-feature generation-aggregation step (Mendes-Moreira et al., 2012), dynamic weighting is implicit in the classifier-regressor aggregation paradigm.

We use a traditional stacking procedure to learn weights to combine the predictions for each bin size (from the base case, where the bin size = 1, to the determined maximum bin size). We generate a meta-feature for each bin size using cross-validation and determine the aggregating weights using a meta-learner. During the testing phase, these weights are applied uniformly to all samples in a test set, making it a static weighting procedure, in contrast to dynamic weighting, where a unique weight is learned for each individual test sample. This approach was chosen to reduce the overall computational complexity of the extended algorithm. However, it is possible that a dynamic weighting paradigm might improve predictive accuracy. We do not prune the meta-feature space, which is one of the main processes in a stacking procedure (Mendes-Moreira et al., 2012). This is because the meta-feature space is not expected to be very large and requiring pruning.

Response discretization is a fundamental part of our algorithm, for which supervised and unsupervised methods have previously been proposed (Dash et al., 2011; Dougherty et al., 1995). Although both supervised and unsupervised approaches are compatible with what we propose, we considered only unsupervised techniques in our evaluation to reduce the computational cost of our experiments. However, one can conjecture that if the supervised methods produce cleaner delineations between the different bins for a given response, it could improve overall predictive accuracy, as the class data on which the classifiers are built would be less noisy. There are other methods which also perform regression by means of classification in an ensemble setting—the work by Ahmad et al. (2018), Halawani et al. (2012) and Gonzalez et al. (2012), for example, are most closely related to what we propose. Ahmad et al. focus on extreme randomized discretization, and use the minimum or maximum of the training data points and bin boundary to estimate the prediction for a new sample, in contrast to our multi-bin and classifier-regressor paradigm. The approach by Gonzalez et al. focus on multi-variate spatio-temporal data and differ from what we propose in two important ways; (a) authors are interested in classifying bands of attributes prior to performing regression, and (b) the final prediction for an unseen sample is obtained by taking the median of the predicted values made by best models which are selected using cross-validation.

An alternative approach to treating imbalanced data is to redress ratio of “rare” and “common” values in a dataset by strategic resampling. Inspired by a similar problem of imbalanced classes in classification, Torgo et al. (2015) proposed two resampling strategies for continuous outcomes in regression analysis. Both methods rely on the notion of relevance (Torgo & Ribeiro, 2007), a function mapping output variable domain to [0, 1]. Relevance is inversely proportional to probability density function of the output variable with rarer observations having higher relevance and more common observations having lower relevance. Choosing a relevance threshold allows one to divide the dataset into two subsets: “common” observations and “rare” observations. Torgo et al. propose two schemes: (i) undersampling of common observations (undersampler) and (ii) undersampling of common observations combined with oversampling of rare observations (SmoteR). Both schemes attempt to create a more balanced dataset allowing for a better representation of values in low probability density areas. Undersampling consists of randomly selecting a subset of normal observations of a pre-specified size. In oversampling, additional “synthetic” rare observations are created by interpolation between existing rare cases and one of its k nearest neighbours. Any regressor can then be applied to a new, re-balanced dataset.

Both, undersampling and SmoteR, are honed towards datasets with rare values concentrated in tail(s) of the distribution. For such distributions, relevance function can be easily calculated and has one or two “relevance bumps”—areas of high relevance, local maxima. However, for more sophisticated distributions (e.g. bimodal), the user has to provide bespoke parameters for construction of the relevance function. This is in contrast to our method, where binning of data is automated via a chosen discretization method. We note that these resampling methods can also be used as part of our proposed framework (see Sect. 6).

3 Methodology

3.1 Algorithm

The original algorithm is described formally in Algorithm 1 and the extension in Algorithm 2. However, we also provide an informal description of the extended algorithm below and a descriptive figure (see Fig. 2). Like the original algorithm, the proposed extension is also split into a build and a prediction phase. In the build phase, the required inputs are a learning set with a continuous response one is interested in modelling, a specified maximum bin size, an unsupervised discretizer, and classification and regression algorithms of one’s choosing. The following steps are performed:

  1. 1.

    Discretize the learning set into n bins, where n runs from 1 to the maximum bin size. Note that \(n=1\) is the base case and so no discretization is performed. For example, for a maximum bin size of 3, we end up with three response sets: set 1—base case, set 2—learning set discretized into 2 bins, and set 3—learning set discretized into 3 bins (see Fig. 1).

  2. 2.

    Build a regressor for the base case. For response sets 2 up to the specified maximum bin size, build a single classifier which classifies whether a new sample belongs to a certain bin, and build a regressor for each bin. Therefore, a single classifier and two regressors are built for response set 2, a single classifier and 3 regressors are built for response set 3, and so on.

  3. 3.

    Learn weights that will be used for combining the predictions from the different response sets using the standard stacking procedure. That is, for each response set, generate a meta-feature using the previous two steps and learn combining weights that will be applied to the predictions made for unseen samples. Using these meta-features, also identify the single best performing bin size, which may be the base case.

It is worth noting that depending on the choice of discretizer and response, one can specify a maximum bin size that is too large, and the number of discretized bins fails to reach the specified size. In this case, the algorithm can be implemented in such a way that it automatically detects the maximum possible bin size before any models are built. This can be achieved by splitting the learning set response into random cross-validation folds, then holding out each fold and attempting to discretize the remaining values at the specified maximum bin size. The minimum discretization size achieved during this cross-validation procedure is chosen as the new maximum. This is how we implemented the algorithm in our evaluation. Given a new sample, the following is performed in the prediction phase:

  1. 1.

    Make predictions for the base case using the base case regressor. Then for response sets 2 up to the maximum bin size, make predictions using the regressors and aggregate them using the paired classifier.

  2. 2.

    After the previous step, one should be left with a number of predictions equal to the maximum bin size. These predictions are then aggregated using the weights from (3) in the building phase. The single best performing bin size can also be identified using the results from (3) in the building phase.

Fig. 2
figure 2

Data-centric representation of the proposed approach where the max bin size equals 2

3.2 Considerations

A major consideration is that of practicality from a model building computational expense perspective. Imagine that one has a response they are interested in predicting and let s be the determined maximum bin size. This means that \(s-1\) classifiers and \(\sum _{l=1}^{s}l\) regressors will be built, making a total of \((s-1) + \sum _{l=1}^{s}l\sim s^2\) models, excluding any additional models built to determine the best hyperparameters for the chosen learning algorithms. Given the k cross-validation folds required to learn the combining weights, the total number of required models built during the learning phase is \(k((s-1) + \sum _{l=1}^{s}l) + (s-1) + \sum _{l=1}^{s}l \sim ks^2\) or \(\mathcal {O}(s^2)\) models, which is more expensive than the 2s or \(\mathcal {O}(s)\) models built in the original algorithm. Intuitively, the obvious benefits of this additional computational cost are that (i) one need not know the best performing bin size apriori, and can specify an arbitrarily large value for the maximum bin size, which the algorithm will automatically scale to the most viable, and (ii) between the weighted prediction and the identified best bin, one is guaranteed to never do significantly worse than the base case.

Implicit in this implementation is the assumption that one will use upsampling to resolve class imbalance issues. Empirical evidence from our previous work indicates that this is necessary, where we found that upsampling outperformed all other methods for handling class imbalance in this setting (Orhobor et al., 2020). However, replacing it with one’s preferred method is trivial. The only other major decision which is left to the discretion of the user is the choice of discretizer, which can either be done using supervised or unsupervised approaches.

figure a
figure b

4 Evaluation setup

4.1 Datasets

We evaluated the proposed algorithm and the two resampling methods using 119 responses from a variety of datasets. Specifically, we considered the 20 genes from the LINCS Phase II dataset (Koleti & Terryn, 2017) which we used in the evaluation of the original algorithm (Algorithm 1), 46 traits from the yeast dataset used by Grinberg et al. (2020), 24 drug targets from quantitative structure-activity relationship problems (Olier et al., 2018), 14 responses from the OpenML AutoML Regression Benchmark (Bischl et al., 2017), and 15 responses from work by Branco et al. (2019). We refer to these dataset collections as gene expression, yeast, QSAR, OpenML, and Branco, respectively, during the discussion of our evaluation in Sect. 5. The exact versions of these datasets used in our experiments are available for download from http://dx.doi.org/10.17632/mpvwnhv4vb.2. We used a 70–30% learning and test set split in our evaluation. See Tables in "Appendix A" for the distribution of imbalance in the considered responses as described by Branco et al. (2019).

4.2 Learning setup

We considered the unsupervised frequency interval and clustering (k-means) discretizers (Dougherty et al., 1995). The specified maximum bin size was 10, 5-fold cross-validation was used in the internal stacking procedure to learn weights and identify the best performing individual bin size on the learning set. We used convex linear regression as our weight learning algorithm, which is multiple linear regression with the constraint that its coefficients are non-negative and must sum to 1 (Breiman, 1996). For the classifiers, we used random forests (RF) (Breiman, 2001), naive bayes (NB) (Rish, 2001), decision trees (DT) (Quinlan, 1986), and k-nearest neighbours (KNN) (Altman, 1992). For the regressors, we used the least absolute shrinkage and selection operator (LS) (Tibshirani, 1996), ridge regression (RR) (Hoerl & Kennard, 1970), RF, and eXtreme gradient boosting (XGB) (Chen & Guestrin, 2016). The RF models were built with 1000 trees and a third of the number of attributes was used as the number of splitting variables considered at each node, the defaults were used for everything else. The KNN models were tuned for the k parameter (the number of neighbours) over a range of values: (1, 3, 5, 7, 11, 13, 15). The regularization parameter for both LS and RR was tuned using 10-fold internal cross-validation. Lastly, the learning rate of the XGB models were tuned with (0.001, 0.01, 0.1, 0.2, 0.3) and the number of rounds set to 1500. Default values were used for all other parameters. Note that, out of the four regression models, only RF is capable of handling non-binary categorical input variables. Hence, LS, RR and XGB were not applied to datasets which included such inputs. Predictive performance was measured as the coefficient of determination (\(R^2\)). All experiments were performed in R (Ihaka & Gentleman, 1996), and the code used in our experiments is available at https://github.com/oghenejokpeme/EFERUC. All our results are available for independent analysis at http://dx.doi.org/10.17632/mpvwnhv4vb.2.

4.3 SmoteR and undersampler

We compared our proposed method with two resampling approaches from Torgo et al. (2015), SmoteR and undersampler. For SmoteR, we used a relevance threshold of 0.7, \(k=5\) and all combinations of undersampling rate of \((50, 100, 200, 300)\%\) and oversampling rate of \((200, 500)\%\). For undersampler, we used a relevance threshold of 0.7 and resampling rates of \((50, 100, 200, 300)\%\). These combination of parameters were used in the original paper (Torgo et al., 2015). The results from the best performing parameters are reported. We applied the four regression methods (LS, RR, RF and XGB), which were tuned as detailed above, to each resampled dataset and reported the \(R^2\) on a test set for the best combination of resampling parameters. For majority of outputs, the relevance function was generated automatically (the phi function in the uba package in R) with either one or two relevance bumps corresponding to tail(s) of the distribution. However, some outputs required bespoke relevance functions with areas of importance identified through inspection of the output histograms. We note that the undersampling algorithm and SmoteR failed in the instances when no extreme values were identified by the relevance function and in some rare cases where a dataset contained discrete or categorical input variables with very low-frequency values.

5 Results

5.1 Base regressor performance

We found that the overall best performing regressor in the base case for the frequency and clustering based discretization approaches was XGB, with RF a close second (see Tables 1 and 2). One might ask why we have separated the base case results by discretizer given that the results should be the same either way. This is because we only included base case results where the proposed approach could be successfully executed for at least one regressor-classifier pair given a particular discretizer. This is important because the learning process can fail depending on how a response is discretized. Failure in this sense can be caused by either the inability to discretize a response beyond a bin size of 1, or the discretization could be performed, but all the values in one bin were the same, causing some learners to fail. It is also worth noting that we did not perform experiments for base regressors which only support numeric input variables but the input dataset under consideration has nominal variables with more than two classes. For example, we did not build a LS regressor for fuelCons dataset in the Branco collection (see Table 11 in “Appendix A”). Paired t-tests showed that there is significant difference in performance between the regressor pairs in the base case for some dataset collections but not for others (see Tables 3 and 4).

Table 1 Base case mean predictive performance (\(R^2\)) for the regressors we considered in our evaluation where the discretization approach is frequency for the data collections we considered
Table 2 Base case mean predictive performance (\(R^2\)) for the regressors we considered in our evaluation where the discretization approach is clustering for the data collections we considered
Table 3 P-values from paired t-tests for each unique regressor pair for the data subsets (including the number of experiments) we considered when the discretization approach is frequency
Table 4 P-values from paired t-tests for each unique regressor pair for the data subsets (including the number of experiments) we considered when the discretization approach is clustering

5.2 Classifier performance

We observed that the best performing classifier, when paired with a particular regressor, is largely dependent on the discretization approach and varies by dataset collection. For example, RF is the best performing classifier on the Branco dataset collection for all regressors when frequency is the discretization approach. This is in contrast to when clustering is used for discretization. In this case, KNN is the best classifier for LS and RR, RF is the best classifier for the RF regressor, and NB and DT perform equally well when paired with XGB (see Tables 5 and 6). Given these results, we argue that for a new unseen problem, one would need to also select the best performing classifier that should be paired with a given regressor. This can be done using any standard model selection approach.

Table 5 Mean predictive performance (\(R^2\)) of the best performing of the best individual bin size and weighted cases for each regressor-classifier pair we considered when the discretization approach is frequency
Table 6 Mean predictive performance (\(R^2\)) of the best performing of the best individual bin size and weighted cases for each regressor-classifier pair we considered when the discretization approach is clustering

5.3 Discretizer performance

Our results suggests that the effect of the chosen discretizer is dependent on the response and the regressor-classifier pair. We only see a statistically significant difference in performance between the frequency and clustering discretizers in 13 of the 80 regressor-classifier pairs across the dataset collections we considered. However, the choice of discretizer might be worth optimising given a new problem, as the analysis for the gene expression dataset collection indicates significant effects on performance (see Table 7).

Table 7 The percentage of regressor-classifier pairs for which the considered discretization approaches (frequency/clustering) strictly outperforms the other

5.4 Algorithm performance

We compared the base regressor performance to the best performing regressor-classifier pair for the frequency and clustering discretizers. Note that the values for the regressor-classifier pairs here are the best performing of either the best performing single bin or the weighted case. For example, assume a predicted response and LS as the base regressor. We compared the base LS performance value to the best performing pair out of LS-NB, LS-RF, LS-DT, and LS-KNN, where the value for each of these pairs is the best performer out of the best performing single bin and the weighted case. There were a total of 853 performance data points. Our approach performed as well as the base case on 134 and strictly outperformed it on 709. The base case never outperforms our proposed approach. For the cases in which our proposed approach outperformed the base case, the average increase in predictive performance (\(R^2\)) is \(0.04 \pm 0.06\) (an average increase of performance of \(\sim\)35%). We performed significance testing using paired t-tests for each individual regressor given the two discretizers we considered. This analysis showed that the difference in performance is statistically significant for the four regressors with a significance level of 0.01 (see Table 8 and Fig. 3). The most marked improvement in performance is observed for the two linear methods, LS and RR, for both discretization approaches, whilst performance of RF and XGB is very similar for base and best regressor-classifier pair. These results suggest that the proposed approach (a) is capable of dealing with the problem of imbalanced distributions in a target response, and (b) is highly unlikely to perform worse than the base case.

Table 8 P-values from paired t-test significance testing of the difference in performance between the regressor base case and the best performing regressor-classifier pair (best of best bin and weighted). n is the number of samples used in the significance tests
Fig. 3
figure 3

Comparison of performance (\(R^2\)) of the base case and the best performing regressor-classifier pair for each regressor and discretization approach, across all datasets. Dashed red line is the \(x=y\) line

5.5 Comparison to resampling approaches

We compared the best performing resamplers (SmoteR and undersampler) to the best performing regressor-classifier pair for the frequency and clustering discretizers. Also note that, as in the previous section, the values for the regressor-classifier pairs here are the best performing of either the best performing single bin or the weighted case. We observed that for all combinations of the resamplers, discretizers, and dataset collections, our approach outperforms the resampling methods (see Table 9 and Fig. 4). Paired t-tests show that these differences in performance are statistically significant with a significance level of 0.01 (see Table 10).

Table 9 Comparison of the resampling approaches (SmoteR and undersampler) to the discretizers (frequency and clustering) used in the evaluation of our proposed methods across the considered collections of datasets (B—Branco, G—Gene expression, O—OpenML, Q—QSAR, Yeast - Y)
Fig. 4
figure 4

Comparison of performance (\(R^2\)) of the two resampling methods and the best performing regressor-classifier pair for each regressor and discretization approach, across all datasets. Dashed black line is the \(x=y\) line

Table 10 P-values from paired t-test significance testing of the difference in performance between the resampling methods and the best performing regressor-classifier pair (best of best bin and weighted) for each discretizer across the considered datasets on a per regressor basis

6 Discussion and conclusion

Although we have demonstrated the utility of the proposed algorithm, we conjecture that modifications to three main components could further improve the predictive accuracy of the algorithm’s output at some additional computational cost. These components are the method by which class imbalance is handled, the type of discretizer, and how weighting is performed in the internal stacking procedure. In our evaluation, we used simple upsampling to deal with class imbalance, however, we expect that a more sophisticated but more computationally expensive approach like SMOTE (Chawla et al., 2002) would perform better. The reasoning behind this is that building the classifiers with higher quality samples will lead to improved classification accuracy of the models, which will in turn improve the classification accuracy of the overall system. For the discretizers, we only considered unsupervised discretizers because of their negligible computational cost, however, we also expect overall performance improvement of the proposed algorithm with supervised discretizers. We think that supervised discretizers would produce cleaner delineations between the different bins, thus reducing the amount of noise between bin regions, improving the predictive accuracy of the classifiers. One might also choose to use a bespoke discretization approach that is tailored to a given problem using domain knowledge, which may outperform the mainstream supervised and unsupervised discretization techniques. Lastly, rather than the static weighting scheme we used in aggregating the predictions from the different bin sizes, it is possible that a dynamic weighting approach might perform better. One might argue that depending on their choice for the internal components (discretizer, class balancing, regressor, classifier, etc), the computational cost (finance and time) of the proposed approach might be limiting. However, we argue that this is not the case, given that compute costs are cheap and are steadily declining, and certain aspects of the algorithm can be parallelised, which should dramatically reduce computation time.

In summary, we have presented an extension to the federated ensemble regression using classification algorithm. Our evaluation shows that it is highly unlikely to perform worse than the base case, and it outperforms state-of-the-art resampling approaches intended to address distribution imbalance in target responses for regression problems.