1 Introduction

The task of supervised machine learning is given a set of recorded observations and their outcomes to predict the outcome of new observations. Standard classification techniques aim for the highest overall accuracy or, equivalently, for the smallest total error, and include among others support vector machines, Bayesian classifiers, logistic regression, decision tree classifiers such as CART (Breiman et al. 1984) and C4.5 (Quinlan 1993), and ensemble methods which build several classifiers and aggregate their predictions such as Bagging (Breiman 1996), AdaBoost (Freund and Schapire 1997) and Random Forests (Breiman 2001).

Of particular interest in certain domains are binary classifiers which deal with cases where only two classes of outcomes are considered, such as fraudulent and legitimate credit card transactions, respondents and non-respondents to a marketing campaign, patients with and without cancer, intrusive and authorised network access, and defaulting and repaying debtors to name a few. In most of these cases, one of the classes is a small minority and consequently traditional classifiers might classify all of its members as belonging to the majority class without any significant overall accuracy loss. The severity of this class imbalance becomes more noticeable when failing to correctly predict a minority class member is more costly than doing so with a member of the majority class, as the case often is.

A remedy to the undesirable situation just described are classifiers which, instead of accuracy, take misclassification costs into account and are thus termed cost-sensitive. We illustrate this idea in the credit card fraud detection framework: accepting a fraudulent transaction as legitimate incurs a cost equal to its amount. Conversely, requiring an additional security check for a transaction (such as contacting the card owner) incurs an overhead cost. The job of a cost-sensitive classifier is to find a cost-minimising balance between overhead costs and fraud costs.

1.1 Related work

An increased research interest in cost-sensitive learning that spanned more than a decade was witnessed in the mid-nineties, including among others (Knoll et al. 1994; Pazzani et al. 1994; Bradford et al. 1998; Ting 1998; Ting and Zheng 1998b; Domingos 1999; Fan et al. 1999; Ting 2000b; Turney 2000; Elkan 2001; Zadrozny and Elkan 2001a; Viola and Jones 2001; Ting 2002; Zadrozny et al. 2003; Ling et al. 2004; Sheng and Ling 2006; Sun et al. 2007). In recent years, a renewed interest is observed (Coussement 2014; Bahnsen et al. 2015; Nikolaou et al. 2016; Xia et al. 2017; Choi et al. 2017; Petrides et al. 2020; Lawrance et al. 2021), partly attributed to practitioners starting to realise the potential of using cost-sensitive models for their businesses by considering real-world monetary costs.

The most recent articles providing an overview of these methods differ in the fraction of the classifier spectrum they cover. Often, cost-sensitive learning is reviewed within surveys on learning on imbalanced datasets as an approach towards treating class imbalance in any domain by artificially introducing costs (He and Garcia 2009; Prati et al. 2009; Sun et al. 2009; Galar et al. 2011). Employing a cost minimisation point of view, Lomax and Vadera (2013) review cost-sensitive classifiers based on decision trees. Overall, cost-sensitive boosting methods receive more attention than other methods such as weighting, altered decisions and cost-sensitive node splitting.

1.2 Our contribution

Our primary contribution in this article is a unifying framework of binary ensemble classifiers that, by design or after slight modification, are cost-sensitive with respect to misclassification costs. It is presented in terms of combinable components that are either directly extracted from the existing literature or indirectly via natural extensions and generalisations we have identified. A notable example of such an extension are ways in which costs can influence the aggregation of the outputs of the individual models in any ensemble, as done in AdaBoost (Sect. 3.4.3). As such, our work goes one step further than being a mere survey. The advantages of our approach include that

  1. (a)

    by abstracting the core ideas behind each classifier, we are able to provide generic descriptions that allow for a fine-grained categorisation with respect to the way costs influence the final decision,

  2. (b)

    it makes the similarities and differences between methods easier to recognise (see for example Table 1 and the equivalence proven in Theorem 1),

  3. (c)

    it clearly indicates the types of costs (constant or record-dependent) that are applicable for each method,

  4. (d)

    combining the framework components in all possible ways not only yields all methods known to date, but also some not previously considered (see for example Tables 2 and 3),

  5. (e)

    the framework components are generic enough to be instantiated with different classifiers, including Random Forests (see for example Table 2), and

  6. (f)

    it highlights research directions that can lead to new cost-sensitive methods (see Sect. 5).

1.3 Outline

We give a brief introduction to decision tree classifiers, ensemble methods and cost-sensitive learning in Sect. 2, before presenting our framework of cost-sensitive components in Sect. 3. In Sect. 4 we discuss the road towards the state of the art, and we end the paper with our conclusions and directions for future work in Sect. 5.

2 Preliminaries

Most of this section’s material is provided with the intention of making the article as self contained as possible. We begin with the basics of decision tree classifiers which play a central role in this work. Readers familiar with these and ensemble methods can proceed to Sect. 2.3 for an introduction to cost-sensitive learning.

A dataset is a collection of records which consist of a number of characteristics, often referred to as attributes or features. A record’s outcome or class is what is of importance and needs to be predicted. Classifiers are trained using a set of records together with their known class in order to be able to predict the class of other records for which it is unknown.

In this work our interest lies in the binary class case where there are only two possibilities for the class. In binary imbalanced datasets, it is customary to call records within the minority class positive and within the majority class negative. Throughout this article, the class of positive records will be denoted by 1 and that of negative ones by 0.

Distinction is made between the different predictions a classifier makes. True Positive (TP) and False Negative (FN) denote a positive record correctly classified and misclassified respectively, and True Negative (TN) and False Positive (FP) are the equivalents for a negative record.

2.1 Decision tree classifiers

Decision tree classifiers are greedy algorithms that try to partition datasets according to their records’ outcomes through a series of successive splits. For each split, the attribute that partitions the records the best according to some metric is chosen, and splitting ends after each partition contains records of only one class, or when further splits do not improve the situation.

Starting from the initial set of all records, or the tree’s root node, splits create branches in the tree which are labelled by the attribute used for splitting. Split sets are known as parent nodes, sets obtained after a split are called children nodes, and sets that are no longer split are called leaf nodes.

Usually, the next step after tree growing is pruning, done by removing nodes which do not improve accuracy. Pruning is a process starting from the bottom of the tree and going up and serves the purpose of reducing over-fitting, that is the effect of the tree’s quality of predictions not generalising beyond the dataset used for training.

In the final tree, each leaf node is assigned the class with the highest frequency among its records, and every record reaching the node will be predicted as having that class.

For certain parts of a decision tree algorithm, such as the node splitting step, it is necessary to know the probabilities that a record reaching a node t is positive (\(P_{t_+}\)) and negative (\(P_{t_-}\)). Since \(P_{t_-}=1-P_{t_+}\), it in fact suffices to know one of the two. Let \(N_t\), \(N_t^+\) and \(N_t^-\) respectively denote the sets of all, the positive, and the negative records at node t (when no subscript is specified we will be referring to the root node of the tree, or the set of all records). Then,

$$\begin{aligned} P_{t_+}=|N_t^+| / |N_t|. \end{aligned}$$

As probabilities based on frequency counts can be unreliable due to high bias and variance, they can be calibrated using methods like Laplace Smoothing as suggested by Pazzani et al. (1994) (\(P_{t_+}=\left( |N_t^+|+1 \right) / \left( |N_t|+2 \right) \)), m-estimation (Cestnik 1990; Zadrozny and Elkan 2001a, b) (\(P_{t_+}=\left( |N_t^+|+b\cdot m \right) /\left( |N_t|+m \right) \), where \(b=|N^+| / |N|\) and \(b\cdot m \approx 10\)), and Curtailment (Zadrozny and Elkan 2001a, b) (each node with less than m records, m as before, gets assigned the probability of its closest ancestor with at least m records) and combinations of the latter with any of the former two.

Examples of decision tree classifiers include the widely used CART (Breiman et al. 1984) and C4.5 (Quinlan 1993), briefly described below.

2.1.1 CART

CART (Classification and Regression Trees, Breiman et al. (1984)) uses the Gini index as a splitting measure during tree growing. More specifically, the attribute chosen for splitting is the one that maximises the following value, known as gain:

$$\begin{aligned} 1-P_{t_+}^2-P_{t_-}^2 - \displaystyle \sum _{i=1}^k\frac{|N_{t_i}|}{|N_t|} \left( 1-P_{t_{i+}}^2-P_{t_{i-}}^2 \right) , \end{aligned}$$
(1)

where \(t_1\) to \(t_k\) are the children nodes of tree node t.

The pruning method used by CART is cost complexity pruning. Given a tree T and \(\alpha \in {\mathbb {R}}\) the aim is to find the subtree of T with the smallest error approximation, which is its error on the training data (the number of wrong predictions over the number of correct ones) plus \(\alpha \) times the number of its leaf nodes. Starting from the tree to be pruned and \(\alpha =0\), a finite sequence of subtrees and increasing \(\alpha \)s is obtained in this way. The tree chosen as the pruned tree is the one with the smallest error approximation on a separate validation set (a set not used for anything else).

By design, CART can take record weights w as input during training and use them to modify the probabilities used in calculating the gain (1) as

$$\begin{aligned} P_{w_{t_+}}=W_t^+ / W_t, \end{aligned}$$

where \(W_t^+\) and \(W_t\) respectively denote the sum of weights of all positive and all records at node t. Moreover, minimum weight is considered instead of minimum error for pruning, and each leaf node is assigned the class with the largest total weight among its records.

Remark 1

It is not clear if and how weighted probabilities can be calibrated.

2.1.2 C4.5

The splitting measure of C4.5 (Quinlan 1993) is an extension of Entropy which is a normalised version known as gain ratio:

$$\begin{aligned} \displaystyle \frac{P_{t_+}\log _2{P_{t_+}} + P_{t_-} \log _2{P_{t_-}} - \sum _{i=1}^k\frac{|N_{t_i}|}{|N_t|} \left( P_{t_{i+}}\log _2{P_{t_{i+}}} + P_{t_{i-}} \log _2{P_{t_{i-}}} \right) }{\sum _{i=1}^k\frac{|N_{t_i}|}{|N_t|}\log _2{\frac{|N_{t_i}|}{|N_t|}}}, \end{aligned}$$

where \(t_1\) to \(t_k\) are the children nodes of tree node t. C4.5 employs reduced error pruning.

Ting (1998, 2002) showed how to implement the weighted CART design in C4.5.

2.2 Ensemble methods

In ensemble methods, several models are trained and their outcomes combined to give the final outcome, usually as a simple or weighted majority vote, the difference lying on whether each model’s vote weighs the same (as in Bagging and Random Forests) or may weigh differently (as in AdaBoost). Probabilities are usually combined by taking their average.

Some of the most important ensemble methods are briefly described below.

2.2.1 Bagging

The idea of Bagging (Breiman 1996) is to build several models (originally CART models, though in principle there is no restriction) on samples of the data. If the sampled sets are of equal size as the original data, they are called bootstraps.

  1. 1.

    Sample with replacement a number of uniformly random and equally sized sets from the training set.

  2. 2.

    For each sampled set, build a model producing outcomes or probabilities.

  3. 3.

    A record’s final outcome (respectively probability \(P_+\)) is the majority vote on its outcome (respectively the average of its probabilities) from all models.

2.2.2 Random Decision Forests

As originally defined by Ho (1995, 1998), Random Decision Forests differs from Bagging in that it samples subsets of the attribute set instead of the data to build decision trees. Here we will abuse terminology slightly and consider Random Decision Forests as a special case of Bagging that builds Random Feature Trees, a name we give to decision trees that are grown on a random subset of the attribute set.

2.2.3 Boosting

Boosting refers to enhancing the predictive power of a “weak” classifier by rerunning it several times, each time focusing more on misclassified records.

AdaBoost (Freund and Schapire 1997) is the most notable example of Boosting in which the focus on each record is in terms of weights: misclassified records after a round get increased weights and correctly classified ones get decreased weights.

  1. 1.

    Assign weight \(w=1\) to each record in the training set.

  2. 2.

    Normalise each record’s weight by dividing it by the sum of the weights of all records.

  3. 3.

    Build a model using the weighted records and obtain each record’s outcome \(h \in \{0,1\}\) and the model’s total error \(\varepsilon \) as the sum of weights of all misclassified records.

  4. 4.

    Update each record’s weight as \(w'=w\cdot e^{-\alpha y_* h_*}\), where \(\alpha =\frac{1}{2}\ln \left( \frac{1-\varepsilon }{\varepsilon }\right) \), and \(h_*\) and \(y_*\) are h and y, the record’s true class, mapped from \(\{0,1\}\) to \(\{-1,1\}\).

  5. 5.

    Repeat steps 2 to 4 as required.

  6. 6.

    A record’s final outcome is the weighted majority vote on its outcome from all models, the weights being the \(\alpha \)s.

A generalised version of AdaBoost with different \(\alpha \) was proposed by Shapire and Singer (1999). Niculescu-Mizil and Caruana (2005) investigated methods for obtaining reliable probabilities from AdaBoost, something that by default it is incapable of doing, through calibrating S, the normalised sum of weighted model votes. These are Logistic Correction (Friedman et al. 2000) (\(P_{lc}=1/(e^{-2\left( 2\cdot S-1\right) }+1)\), where the name was coined by Niculescu-Mizil and Caruana (2005)), Platt Scaling (Platt 1999; Zadrozny and Elkan 2002) (\(P_{ps}=1/(e^{A\cdot S+B}+1)\), where A and B maximise \(P_{ps}\) on a validation set with classes mapped from \(\{0,1\}\) to \(\{(|N^+|+1)/(|N^+|+2),1/(|N^-|+2)\}\)) and Isotonic Regression (Robertson et al. 1988; Zadrozny and Elkan 2002) (essentially an application of the PAV algorithm (Ayer et al. 1955): (1) sort training records according to their sum S, (2) initialise each record’s probability as 0 if negative and 1 if positive, (3) whenever a record has higher probability than its successor, replace the probability of both by their average and consider them as one record thus forming intervals, and (4) a record’s probability is the one of the interval its sum S falls in).

2.2.4 Random Forests

Random Forests (Breiman 2001) is in fact Bagging confined to tree classifiers with Random Input Selection, which at each splitting step choose the best attribute out of a small randomly chosen subset of all attributes, and are not pruned.

2.3 Cost-sensitive learning

Cost-sensitive (CS) learning refers to aiming at minimising costs related to the dataset instead of error, typically via these costs influencing the classification process in some way. In this work we only consider misclassification costs, though other types exist, such as the cost of attribute acquisition and obtaining attribute values that are missing.

Traditionally, different costs (or benefits) assigned to each type of classification are given in the form of a Cost Matrix:

$$\begin{aligned} CM=\left[ \begin{array}{cc} C_{TP} &{}\quad C_{FN} \quad \\ C_{FP} &{}\quad C_{TN}\\ \end{array} \right] \end{aligned}$$

In the sequel, we only consider misclassification costs that are higher than costs of correct classification, and by letting \(C_{TN}'=C_{TP}'=0\), \(C_{FP}'=C_{FP}-C_{TN}\) and \(C_{FN}'=C_{FN}-C_{TP}\), we can reduce our attention to only misclassification costs, even when the other costs are non-zero (Elkan 2001).

Costs can be either constant for all records of a class, often called class-dependent, or vary per record which we will call record-dependent. For instance, in credit card fraud detection, false positive costs \(C_{FP}\) are equal to overhead costs and can be the same for all transactions, whereas false negative costs \(C_{FN}^i\) depend on the individual transactions i and are equal to the corresponding amount.

2.3.1 CS decisions

A cost-insensitive classifier would label a record as positive if \(P_+ > P_-\) or equivalently if \(P_+ > 0.5\). As explained by Elkan (2001), this decision can be made cost-sensitive by using the minimum expected cost (MEC) criterion, that is by labelling a record as positive if \( C_{FP}\cdot P_- < C_{FN} \cdot P_+\), or equivalently if \(P_+ > T_{cs}\), where \(T_{cs}\) is the cost-sensitive threshold

$$\begin{aligned} T_{cs}=\frac{C_{FP}}{C_{FP}+C_{FN}}. \end{aligned}$$
(2)

Note that \(T_{cs}=0.5\) corresponds to the case of equal misclassification costs, \(C_{FN}>C_{FP}\) implies \(T_{cs} <0.5\) and \(C_{FN}<C_{FP}\) implies \(T_{cs} > 0.5\).

Remark 2

The case of record-dependent costs can be treated by considering a distinct threshold \(T^i_{cs}\) per record i, as first observed by Zadrozny and Elkan (2001b).

Thresholding (Sheng and Ling 2006), instead of using the theoretical threshold (2), looks for the best threshold \(T_{thr}\) among all probabilities obtained from the training set by computing the total costs for each on a validation set and choosing the one with the lowest.

2.3.2 CS data sampling

To induce decision making using the threshold \(T_{cs}\) of (2) instead of 0.5 when the cost ratio is constant, we can under-sample the negative training records by only sampling \(|N^-|\cdot \frac{C_{FP}}{C_{FN}}\) out of \(|N^-|\) (Elkan 2001). Equivalently, we can over-sample the positive training records by duplicating existing ones or by synthesising new records to reach a total of \(|N^+|\cdot \frac{C_{FN}}{C_{FP}}\) instead of \(|N^+|\). Sampling and duplicating can either be random or targeted according to some rule. Naturally, any combination of these techniques that yields a positive-negative ratio equal to

$$\begin{aligned} r_{cs}=\frac{|N^+|}{|N^-|} \cdot \frac{C_{FN}}{C_{FP}} \end{aligned}$$
(3)

is possible, and we shall call it hybrid-sampling.

Remark 3

If \(\frac{C_{FN}}{C_{FP}} > \frac{|N^-|}{|N^+|}\) then sampling turns the positive class into the majority. Under-sampling reduces the size of training data and consequently model training time at the cost of losing potentially useful data. On the other hand, over-sampling makes use of all data but leads to increased training times, and record duplication entails the risk of over-fitting.

One method for synthesising new records, thus avoiding the risk of over-fitting is the Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al. 2002), which over-samples positive records by creating new ones that are nearest neighbours (roughly speaking, that have the closest similarity attribute-wise) to existing ones:

  1. 1.

    Choose a positive record and find some (say k) of its nearest neighbours.

  2. 2.

    For each nearest neighbour, find its per attribute distance \(d_a\) with the positive record.

  3. 3.

    Create a new positive record with attributes those of the positive record minus a random fraction of \(d_a\).

  4. 4.

    Repeat as required, keeping k fixed.

An alternative to sampling or synthesising records to reach the ratio \(r_{cs}\) in (3) is Cost-Proportionate Rejection (CPR) Sampling (Zadrozny et al. 2003), which is also applicable when costs are record-dependent, and where a sampled record is accepted with probability proportional to its cost:

  1. 1.

    Sample with replacement a uniformly random set from the training set.

  2. 2.

    Create a new training set that includes each of the sampled set’s elements with probability \(C_{FN}/\max \{C_{FN},C_{FP}\} \) if positive or \(C_{FP}/\max \{C_{FN},C_{FP}\}\) if negative.

2.3.3 CS record weights

Ting (1998) was the first to explicitly incorporate costs in the weights \(w_+\) of the positive and \(w_-\) of the negative classes used in weighted classifiers, followed by normalisation:

$$\begin{aligned} w_+= C_{FN} \,\hbox { and }\, w_-= C_{FP}. \end{aligned}$$
(4)

Remark 4

We observe that record-dependent costs can be easily taken into account by replacing \(C_{FN}\) by \(C^i_{FN}\) and \(C_{FP}\) by \(C^i_{FP}\). Clearly, equal costs yield equal weights.

3 Cost-sensitive ensemble methods

Cost-sensitive ensemble methods can be divided into three main categories, depending on when costs influence the classification process: before training at the data level (Sect. 3.1), during training at the algorithm level (Sect. 3.2), and after training at the decision level (Sect. 3.4). Figure 1 provides a summary. Naturally, combinations of these are possible and yield what we shall call hybrid methods.

Fig. 1
figure 1

The main categorisation of cost-sensitive ensemble methods with respect to the point that costs influence the classification process. For a non-exhaustive list of pre- and during-training methods see Table 2, and for post-training methods see Table 3

The descriptions we provide are brief and sometimes slightly modified from the original ones in order to unify and generalise them, to incorporate costs when they are not explicitly mentioned, and, where possible, to include record-dependent costs when the original descriptions were given with only constant costs in mind. They are also base-classifier independent, meaning that apart from decision trees that we focus on in this paper, other classifiers such as logistic regression and support vector machines can be used as well.

3.1 Pre-training methods

Pre-training methods employ either cost-sensitive data sampling or record weights, and as a result models need to be retrained in case the costs change.

3.1.1 Sampling based

CS-SampleEnsemble By this we describe the class of ensembles that use some cost-sensitive sampling method to modify the training set before training each base classifier. This is a generalisation of the concept used in Costing (Zadrozny et al. 2003) where subsets of the training set are obtained by CPR sampling. The other examples found in the literature are Balanced Random Forest (Chen et al. 2004) which uses equally sized sets (originally meant to be balanced and thus cost-insensitive) to build Random Input Selection Tree models (Sect. 2.2.4), EasyEnsemble (Liu et al. 2008), an ensemble of ensembles where under-sampling is used to obtain a number of equally sized sets and build AdaBoost models, and SMOTEbagging and UnderBagging (Wang and Yao 2009) which respectively use SMOTE and random undersampling. It can be seen as Bagging (Sect. 2.2.1) with modified sampling step:

  1. 1.

    Using a cost-sensitive sampling method, sample a number of sets from the training set.

Remark 5

Although the original definition of Costing did not include models that also produce probabilities, we could not find any reason to exclude them. Costing reduces to Bagging when costs are equal, as CPR-sampling first randomly samples the data with replacement. Other sampling methods however can sample with replacement at most one of the classes.

CS-preSampleEnsemble We propose this as the class of ensembles that use some cost-sensitive sampling method to first modify the training set before sampling subsets from it in the manner of Bagging. It can be viewed as Bagging (Sect. 2.2.1) with the following different steps:

  1. 0.

    Modify the training set by means of a cost-sensitive sampling method.

  2. 1.

    Sample with replacement a number of uniformly random and equally sized sets from the modified training set.

Remark 6

Using CS-undersampling has the disadvantage of producing modified sets of a rather small size. Also, when \(C_{FP}=C_{FN}\), CS-preSampleEnsemble reduces to Bagging.

CS-SampleBoost By this we describe the class of AdaBoost variants that modify the weight of each record by using some sampling method on the training set. The examples found in the literature are SMOTEBoost (Chawla et al. 2003) and RUSBoost (Seiffert et al. 2010) which respectively use SMOTE and random undersampling. The steps different to AdaBoost (Sect. 2.2.3) are:

  1. 2a.

    Modify the training set by means of a cost-sensitive sampling method.

  2. 2b.

    Normalise each record’s weight in the modified set by dividing it by the sum of weights of all records in it.

  3. 3.

    Build a model using the modified set of weighted records and obtain for the initial set each record’s outcome \(h \in \{0,1\}\) and the model’s total error \(\varepsilon \) as the sum of weights of all misclassified records.

  4. 5.

    Repeat steps 2a to 4 as required.

Remark 7

When \(C_{FP}=C_{FN}\), CS-SampleBoost reduces to AdaBoost.

3.1.2 Weights based

Naive CS AdaBoost Mentioned by Viola and Jones (2001), Zadrozny et al. (2003) and Masnadi-Shirazi and Vasconcelos (2011), its only difference to AdaBoost are the cost-dependent initial weights in Step 1.

  1. 1.

    Assign to each positive and negative record in the training set weights as in (4).Footnote 1

Remark 8

When \(C_{FP}=C_{FN}\), Naive CS AdaBoost reduces to AdaBoost.

CS-WeightedEnsemble By this we describe the class of special cases of Bagging where the models built are weighted and the weights initialised as in (4). It is a generalisation of Weighted Random Forest (Chen et al. 2004), a variant of Random Forests that builds weighted Random Input Selection Tree models (see Sect. 2.2.4). Weighted CART and C4.5 can also be used. We only describe the steps that are different to Bagging (Sect. 2.2.1):

  1. 0.

    Assign to each positive and negative record in the training set weights as in (4).

  2. 2.

    For each sampled set, normalise the weights and build a weighted model.

  3. 3.

    A record’s final outcome is the weighted majority vote on its outcome from all models, the weights being the average record weights at the tree nodes reached by the record.

3.2 During-training methods

In during-training methods, costs directly influence the way base classifiers are built, which therefore have to be rebuilt when costs change.

3.2.1 CS base ensemble

Cost-insensitive ensemble methods such as Bagging can be made cost-sensitive by employing base classifiers whose decisions based on maximising accuracy are replaced by decisions based on minimising misclassification costs. Restricting our attention in this paper to binary decision trees, the possibilities are cost-sensitive node splitting and/or tree pruning.

CS node Splitting By replacing the impurity measure (such as Entropy or the Gini index) by a cost measure, node splitting in a decision tree is made cost-sensitive.

An example is Decision Trees with Minimal Costs (Ling et al. 2004) that do cost-minimising splitting and labelling without pruning. In detail, during tree-growing, a node t is labelled according to \(P_{t_+}>T_{cs}\). The costs \(C_t\) at this node are \(\sum _{i \in N_{t}^+} C_{FN}^i\) if the node is labelled as negative and \(\sum _{i \in N_{t}^-} C_{FP}^i\) otherwise. The attribute selected for node-splitting is the one that instead of maximising the gain value maximises \(C_t-\sum _{i=1}^k C_{t_i},\) where \(C_{t_i}\) to \(C_{t_k}\) are the costs at the children nodes of node t, calculated the same way as \(C_t\).

CS pruning By replacing the accuracy measure by a cost measure, tree pruning becomes cost-sensitive. Examples include Reduced Cost Pruning (Pazzani et al. 1994) and Cost-Sensitive Pruning (Bradford et al. 1998) which respectively modify the reduced error pruning of C4.5 and the cost-complexity pruning of CART to calculate costs instead of errors, and Knoll et al. (1994) where both are done.

A hybrid example are Cost Sensitive Decision Trees (Bahnsen et al. 2015) that do the same cost-minimising splitting, labelling and pruning mentioned above, with emphasis on record-dependent costs.

Table 1 Details of the CS-variants of AdaBoost

3.3 CS variants of AdaBoost

CS variants of AdaBoost typically use the misclassification costs to update the weights of misclassified records differently per class. They include UBoostFootnote 2 (Ting and Zheng 1998a), AdaCost (Fan et al. 1999), AdaUBoost (Karakoulas and Shawe-Taylor 1999), Asymmetric AdaBoost (AssymAB, (Viola and Jones 2001), CSB0 (Ting 2000a, b), CSB1 and CSB2 (Ting 2000a), AdaC1, AdaC2 and AdaC3 (Sun et al. 2007), and Cost-Sensitive AdaBoost (CSAB, (Masnadi-Shirazi and Vasconcelos 2011)). Their steps different to AdaBoost are:

  1. 1.

    Assign to each record in the training set weight according to Table 1.

  2. 4.

    Update each record’s weight according to Table 1.

  3. 6.

    A record’s final outcome is the weighted majority vote on its outcome from all models, the weights being according to Table 1.

Remark 9

It is not immediately clear how record-dependent costs can be used in CSAB.

All these have been theoretically analysed in (Nikolaou et al. 2016) together with Naive CS AdaBoost (Sect. 3.1.2) and AdaMEC (Sect. 3.4.3) from different viewpoints, with the conclusion that only the latter two and AsymAB have solid foundations, while calibration improves performance.

3.4 Post-training methods

In post-training methods, misclassification costs influence the classification step. Thus, when not used to build hybrid models, they offer the advantage of not having to retrain models when costs change. Some of these methods are only applicable if the costs of the records to be predicted are known at the time of prediction, thus when this is not the case, the unknown costs need to be somehow estimated.

3.4.1 Direct minimum expected cost classification

Direct minimum expected cost classification (DMECC) bases the final decision of a classifier producing probabilities on a threshold \(T \in \{T_{cs},T_{thr}\} \) (see Sect. 2.3.1).

  1. 1.

    Build a model producing probabilities.

  2. 2.

    A record’s outcome is obtained according to \(P_+ > T\).

One possibility is to apply DMECC to an ensemble producing probabilities, as done in Calibrated AdaBoost (Nikolaou and Brown 2015), where AdaBoost probabilities are obtained using Platt Scaling (see Sect. 2.2.3) and \(T=T_{cs}\).

Another possibility we have identified is to use DMECC to obtain a cost-sensitive outcome (instead of the default cost-insensitive one) from the base classifiers in any ensemble, when these are capable of producing probabilities, leading us to propose DMECC-Ensemble and DMECC-AdaBoost, which do so respectively in Bagging and AdaBoost.

Remark 10

If \(T_{cs}\) is constant for all records then DMECC-Ensemble and DMECC-AdaBoost should be equivalent to CS-SampleEnsemble and CS-SampleBoost respectively (excluding CPR-sampling), though relying on probabilities instead of data-sampling. This follows from the equivalence of DMECC with threshold \(T_{cs}\) and CS-sampling shown by Elkan (2001) and discussed in Sect. 2.3.2.

Remark 11

DMECC with threshold \(T_{cs}\) is only applicable if the costs of the records to be predicted are known at the time of prediction (as \(T_{cs}\) depends on them).

3.4.2 MetaCost

As originally proposed, MetaCost (Domingos 1999) relabels the training set using the predictions obtained by Bagging with DMECC and re-uses it to train a single classifier. We generalise this concept to use the predictions of any cost-sensitive classifier in the direction of (Ting 2000b) where AdaMEC is used (see Sect. 3.4.3 below) and CSB0.

  1. 1.

    Replace each training record’s outcome by the one obtained from a cost-sensitive model.

  2. 2.

    Build a single (cost-insensitive) model using the relabelled records.

  3. 3.

    A record’s outcome is its outcome from the new model.

Remark 12

MetaCost reduces cost-sensitive ensemble models to single models, which are typically more explainable but less capable of capturing all the data characteristics. As observed by Ting (2000b), these single models often perform worse than the ensemble models.

3.4.3 CS ensemble voting

Costs can also be taken into account in an ensemble during weighted majority voting.

Cost-sensitive weights for model votes In certain AdaBoost variants, such as Naive CS AdaBoost, \(\varepsilon \) is calculated on cost-based record weights and hence results in a cost-sensitive \(\alpha \), which serves as the weight of the model’s vote in weighted majority voting. We observe that it is in fact straightforward to mimic this for any ensemble as follows:

  1. 1.

    Assign to each positive and negative record in the training set weights as in (4).

  2. 2.

    For each model in the ensemble compute \(\alpha =f(\varepsilon )\), where f is some function and \(\varepsilon \) is the sum of weights of all misclassified records from the training or a validation set.

Possibilities for the function f include

$$\begin{aligned} f(\varepsilon ){=}\ln \left( {(}1{-}\varepsilon {)}/\varepsilon \right) {,} \ \ f{(}\varepsilon {)}{=}1{-}\varepsilon , \ \ f{(}\varepsilon {)}{=}e^{{(}1{-}\varepsilon {)}/\varepsilon } \text { and } f(\varepsilon ){=}\left( (1{-}\varepsilon )/\varepsilon \right) ^2 \end{aligned}$$
(5)

where the latter two providing a right-skewed distribution.

MEC-voting By MEC-Voting we shall refer to the generalisation to any ensemble of the idea behind AdaBoost with minimum expected cost criterion (Ting 2000a, b), or AdaMEC as coined by Nikolaou and Brown (2015), which is AdaBoost with modified Step 6:

  1. 6.

    A record’s final outcome is the weighted majority vote on its outcome from all models, the weights being the product of \(\alpha \) and the misclassification cost associated with the outcome.

Majority threshold adjustment (MTA) The outcome of weighted majority voting is positive if the sum of positive votes is greater than 0.5. Alternative cost-sensitive majority thresholds that can be used are \(T_{cs}\) (Sect. 2.3.1) and \(T_{mthr}\) which we define as the one that yields the least costs on a validation set as done in Thresholding described in Sect. 2.3.1.

Theorem 1

MEC-Voting and MTA using \(T_{cs}\) are equivalent.

Proof

The weighted sums of positive votes with and without MEC-Voting in an ensemble of m models are respectively \(S_1= \frac{\sum _{i:h_i=1}{\alpha _i C_{h_i}}}{ \sum _{i=1}^m{\alpha _i C_{h_i}}}\) and \(S_2=\frac{\sum _{i:h_i=1}{\alpha _i}}{ \sum _{i=1}^m{\alpha _i }}\), where \(C_{h_i}\in \{C_{FN},C_{FP}\}\) is the (non-zero) misclassification cost associated with the record’s outcome \(h_i\) from model i.

If \(S_1=0\) then \(S_2=0\) and the theorem holds trivially. Otherwise, \(S_1\) can be expressed in terms of \(S_2\):

$$\begin{aligned} \begin{array}{*{20}{l}} S_1&{}=&{}\frac{\sum _{i:h_i=1}{\alpha _i C_{h_i}}}{ \sum _{i:h_i=1}{\alpha _i C_{h_i}}+\sum _{i:h_i=0}{\alpha _i C_{h_i}}} &{}=&{}\frac{1}{1+\frac{\sum _{i:h_i=0}{\alpha _i C_{h_i}}}{\sum _{i:h_i=1}{\alpha _i C_{h_i}}}} &{}=&{}\frac{1}{1+\frac{C_{FP}}{C_{FN}}\left( \frac{\sum _{i:h_i=0}{\alpha _i}}{\sum _{i:h_i=1}{\alpha _i}} + 1 - 1 \right) }\\ &{}=&{}\frac{1}{1+\frac{C_{FP}}{C_{FN}}\left( \frac{\sum _{i=1}^m{\alpha _i}}{\sum _{i:h_i=1}{\alpha _i}} - 1 \right) } &{}=&{}\frac{1}{1+\frac{C_{FP}}{C_{FN}}\left( \frac{1}{S_2} - 1 \right) } .\\ \end{array} \end{aligned}$$

Solving this equality for \(S_2\) and using the fact that a record’s outcome is positive if \(S_1>0.5\) and negative otherwise, we obtain \(S_2 > \frac{C_{FN}}{C_{FN}+C_{FP}}\), which is MTA using \(T_{cs}\) as required. \(\square \)

Remark 13

The equivalence of Theorem 1 was shown specifically for AdaMEC by Nikolaou et al. (2016).

Remark 14

Both MEC-Voting and MTA using \(T_{cs}\) are only applicable if the costs of the records to be predicted are known at the time of prediction.

Table 2 The basic ensembles and a (non-exhaustive) list of pre- and during-training methods derived from our framework (independent of base classifier), with abbreviations

3.5 Hybrid methods

Table 3 gives an overview of how post-training methods can be combined with pre- or during training methods to yield hybrid methods.

Table 3 Overview of the post-training methods combinable with each type of pre- and during-training method according to the probability calibration used

4 Towards determining the state-of-the-art in cost-sensitive learning

A natural question to ask is which, if any, of the described framework components can be considered as state-of-the-art. Obtaining an indication on this requires a rigorous experimental comparison over a range of datasets, often referred to as benchmarking. Such a benchmarking would be most useful if it considers sufficiently many publicly available datasets in order to allow reproducibility and the updating of the benchmarking via the inclusion of newly proposed methods. As already mentioned in the Introduction (Sect. 1.1), there are two main uses of cost-sensitive learning, which should be considered separately.

The first use is for treating class-imbalance alone, in which case the misclassification costs do not necessarily have to be derived from the dataset or application domain and can be randomly assigned. Typically, different pairs of constant (class-dependent) costs are tried out in search for the one that gives the best results according to the metric of choice, which should be suitable for imbalanced datasets. A benchmarking can therefore be performed on a selection of the many imbalanced datasets already publicly available, and different sub-cases can depend on the level of imbalance. Although such a benchmarking will give an indication on the framework components that are best suited for treating class imbalance, in order to provide a complete picture it needs to part of a more general benchmarking that includes cost-insensitive methods specifically aiming at treating imbalance (see for example (Galar et al. 2011) for an overview) as well.

The second use is for minimising misclassification costs that are derived from the application domain or dataset, irrespective of the level of imbalance. In some cases, determining these costs is straightforward (such as the amount of a donation or transaction), whereas in others it can be non-trivial. For instance, the provider of the dataset considered by Petrides et al. (2020) had to calculate the costs using their domain knowledge and experience, and the costs considered by Lawrance et al. (2021) depend on a parameter whose estimation first requires to design and carry-out appropriate experiments.

Typically, the evaluation measure depends on these costs, often simply being their sum. This, however, might not be sufficient in certain cases, which include fraud detection and direct marketing, and an additional evaluation measure might be necessary. For instance, in fraud detection we are in practice interested in models that do not disrupt the operation of the business, thus our attention should be restricted to models that not only achieve the lowest costs, but also a realistic False Positive Rate (FPR). In the case of credit cards in particular, a FPR of at most \(3\%\) should be within the capacity of investigating agents. In direct marketing scenarios, contacting potential respondents to a request (such as for making a donation or a purchase) may incur costs (such as for postage). Thus, the application of a model assumes that a budget that provides the capacity to contact all those the model predicts as respondents is readily available, which might not always be the case. For this reason it would be more appropriate to also look at the return on investment (ROI), given as the net profit over expenditure. These two examples suggest that a benchmarking should probably be done per application domain, something not unusual (see for instance (Lessmann et al. 2015) for a benchmark in the domain of credit scoring, albeit not focusing on cost-sensitive methods and using private datasets).

The main obstacle preventing such a benchmarking is the absence of sufficiently many publicly available datasets in general, let alone per application domain. The alternative of using publicly available datasets having no attribute from which misclassification costs can be derived, and assigning random costs to them (as done for just treating class imbalance we discussed above), might give misleading results in the absence of a clear business case. Moreover, constant costs are quite rare in domains such as direct marketing and fraud detection, and assigning random record-dependent costs is a difficult task, mainly because the distribution from which they should be taken is unknown. We hope that practitioners will receive this as an open call to make more datasets publicly available in order to facilitate advances in the field, from which they can benefit as well.

For the interested reader, in Appendix A, we provide four examples of cost-minimisation applications of cost-sensitive learning using publicly available datasets.

5 Conclusions and future work

In this paper we have described and categorised available cost-sensitive methods with respect to misclassification costs by means of a unifying framework that also allowed us to identify new combinations and extend ideas across classifiers. This was our main contribution, which clarifies the picture and should aid further developments in the domain and serve as a baseline to which newly proposed methods should be compared.

As our work has identified, some possibilities for new cost-sensitive ensemble methods can arise by developing a novel approach in any of the following domains: (a) cost-sensitive base classifiers such as decision trees with cost-sensitive node splitting and pruning, (b) cost-sensitive sampling, and using costs to (c) specify record weights, (d) update weights in AdaBoost variants, and (e) specify classifier weights for ensemble voting.

Worth exploring are how Ensemble Pruning (Zhou 2012) and Stacking (Wolpert 1992) (which instead of using the outputs of all the ensemble members for voting or averaging, respectively first choose a subset of them, or use them to train a second model) can be made cost-sensitive for inclusion in the post-training methods. It would also be interesting to investigate whether Gradient Boosting (Friedman 2001), another representative of the boosting principle, and its popular variant XGBoost (Chen and Guestrin 2016) can have cost-sensitive variants, in particular with record-dependent costs, apart from being combined with post-training methods.

Another avenue for future research is to examine cost-sensitivity with respect to other types of costs as mentioned by Turney (2000), particularly costs of attribute acquisition and costs of obtaining missing values in the data (Turney 1995; Ling et al. 2004), that are important in many real world applications.