Consistent algorithms for multiclass classiﬁcation with an abstain option

: We consider the problem of n -class classiﬁcation ( n ≥ 2), where the classiﬁer can choose to abstain from making predictions at a given cost, say, a factor α of the cost of misclassiﬁcation. Our goal is to design consistent algorithms for such n -class classiﬁcation problems with a ‘reject option’; while such algorithms are known for the binary ( n = 2) case, little has been understood for the general multiclass case. We show that the well known Crammer-Singer surrogate and the one-vs-all hinge loss, albeit with a diﬀerent predictor than the standard argmax, yield consistent algorithms for this problem when α = 12 . More interestingly, we design a new convex surrogate, which we call the binary encoded predictions surrogate, that is also consistent for this problem when α = 12 and operates on a much lower dimensional space (log( n ) as opposed to n ). We also construct modiﬁed versions of all these three surrogates to be consistent for any given α ∈ [0 , 12 ].


Introduction
In classification problems, one often encounters cases where it would be better for the classifier to take no decision and abstain from predicting rather than Consistency for abstaining multiclass algorithms 531 make a wrong prediction. For example, in the problem of medical diagnosis with inexpensive tests as features, a conclusive decision is good, but in the face of uncertainty, it is better to not make a prediction and instead go for costlier tests.

Binary classification with an abstain option
For the case of binary classification, El-Yaniv and Wiener [7,8] call this problem selective classification. They study the fundamental trade-off between abstaining and predicting and give theoretical results, but the algorithms suggested by their theory are not computationally tractable due to the usage of ERM oracles.
Another branch of work for the binary classification case [2,28,13] has roots in decision theory, where abstaining is just another decision that incurs a cost. The main idea here is to find appropriate computationally efficient optimization based algorithms that give the optimal answer in the limit of infinite data. Yuan and Wegkamp [28] show that many standard convex optimization based procedures for binary classification like logistic regression, least squares classification and exponential loss minimization (Adaboost) yield consistent algorithms for this problem. But as Bartlett and Wegkamp [2] show, the algorithm based on minimizing the hinge loss (SVM) requires a modification to be consistent. The suggested modification is rather simple: use a double hinge loss with three linear segments instead of the two segments in standard hinge loss, the ratio of slopes of the two non-flat segments depends on the cost of abstaining α. Cortes et al. [4] learn a separate "rejector" function, in addition to a classifier, for identifying instances to reject. They also show that such an algorithm is consistent for this problem. There have been several empirical studies [10,11,12,9] as well on this topic.

Multiclass classification with an abstain option
In the case of multiclass classification with an abstain option, there has been empirical work [31,21,27]. However, to the best of our knowledge, there exists very little theoretical work on this problem. Zhang et al. [29] define a new family of surrogates for this problem, but their family of surrogates are known to be not consistent for the decision theoretic version of the problem. There has also been work on learning separate thresholds for rejection per class [15], but such algorithms are also not known to be consistent for this problem.
We fill this gap in the literature by providing a formal treatment of the multiclass classification problem with an abstain option in the decision theoretic setting. Our work can also be seen to be in the statistical decision theoretic setting, and can be seen to generalize and extend the works of Bartlett and Wegkamp [2], Yuan and Wegkamp [28] and Grandvalet et al. [13] to the multiclass setting. In particular, we give consistent algorithms for this problem.
The reject option is accommodated into the problem of n-class classification through the evaluation metric. We seek a function h : X →{1, 2, . . . , n, ⊥}, where X is the instance space, and the n classes are denoted by {1, 2, . . . , n} = [n] and ⊥ denotes the action of abstaining or the 'reject' option. The loss incurred by such a function on an example (x, y) with h(x) = t is given by where α ∈ [0, 1] denotes the cost of abstaining. We will call this loss the abstain(α) loss.
It can be easily shown that the Bayes optimal risk for the above loss is attained by the function h * α : X →[n] ∪ {⊥} given by where p x (y) = P (Y = y|X = x). The above is often called 'Chow's rule' [3]. It can also be seen that the interesting range of values for α is [0, n−1 n ] as for all α > n−1 n the Bayes optimal classifier for the abstain(α) loss never abstains. For example, in binary classification, only α ≤ 1 2 is meaningful, as higher values of α imply it is never optimal to abstain.
For small α, the classifier h * α acts as a high-confidence classifier and would be useful in applications like medical diagnosis. For example, if one wishes to learn a classifier for diagnosing an illness with 80% confidence, and recommend further medical tests if it is not possible, the ideal classifier would be h * 0.2 , which is the minimizer of the abstain(0.2) loss. If α = 1 2 , the Bayes classifier h * α has a very appealing structure: a class y ∈ [n] is predicted only if the class y has a simple majority. The abstain(α) loss is also useful in applications where a 'greater than 1 − α conditional probability detector' can be used as a black box. For example a greater than 1 2 conditional probability detector plays a crucial role in hierarchical classification [19].
Abstain(α) loss with α = 1 2 will be the main focus of our paper and will be the default choice when the abstain loss is referred to without any reference to α. This will be the case in Sections 3, 4, 5 and 7. In Section 6, we show how to extend our results to the case α ≤ 1/2. On the other hand, we leave the case α > 1/2 to future work. We explain why this case might be fundamentally different in Section 1.4.
Since the Bayes classifier h * α depends only on the conditional distribution of Y |X, any algorithm that gives a consistent estimator of the conditional probability of the classes, e.g., minimizing the one vs all squared loss, [17,25], can be made into a consistent algorithm (with a suitable change in the decision) for this problem. However, smooth surrogates that estimate the conditional probability do much more than what is necessary to solve this problem. Consistent piecewise linear surrogate minimizing algorithms, on the other hand, do only what is needed, in accordance with Vapnik's dictum [23]: When solving a given problem, try to avoid solving a more general problem as an intermediate step.
For example, least squares classification, logistic regression and SVM are all consistent for standard binary classification, but SVMs avoid the strictly harder conditional probability estimation problem as an intermediate problem. Piecewise linear surrogates (like the hinge loss used in SVM) have other advantages like easier optimization and sparsity (in the dual) as well, hence finding consistent piecewise linear surrogates for the abstain loss is an important and interesting task.

Contributions
We show that the n-dimensional multiclass surrogate of Crammer and Singer (CS) [5] and the simple one vs all hinge (OVA) surrogate loss [20] both yield consistent algorithms for the abstain 1 2 loss. Both these surrogates are not consistent for the standard multiclass classification problem [22,16,30].
We then construct a new convex piecewise linear surrogate, which we call the binary encoded predictions (BEP) surrogate that operates on a log 2 (n) dimensional space, and yields a consistent algorithm for the n-class abstain 1 2 loss. When optimized over comparable function classes, this algorithm is more efficient than the Crammer-Singer and one vs all algorithms as it requires to only find log 2 (n) functions over the instance space, as opposed to n functions. This result is surprising because, it has been shown that one needs to minimize at least a n − 1 dimensional convex surrogate to get a consistent algorithm for the standard n-class problem, i.e., without the reject option [17]. Also, the only known generic way of generating consistent convex surrogate minimizing algorithms for an arbitrary loss [17,18], when applied to the n-class abstain loss, yields an n-dimensional surrogate.
We also give modified versions of the CS, OVA and BEP surrogates that yield consistent algorithms for the abstain(α) loss for any given α ∈ 0, 1 2 .

The role of α
Conditional probability estimation based surrogates can be used for designing consistent algorithms for the n-class problem with the reject option for any α ∈ (0, n−1 n ), but the Crammer-Singer surrogate, the one vs all hinge and the BEP surrogate and their corresponding variants all yield consistent algorithms only for α ∈ [0, 1 2 ]. While this may seem restrictive, we contend that these form an interesting and useful set of problems to solve. We also suspect that, abstain(α) problems with α > 1 2 are fundamentally more difficult than those with α ≤ 1 2 , for the reason that evaluating the Bayes classifier h * α (x) can be done for α ≤ 1 2 without finding the maximum conditional probability -just check if any class has conditional probability greater than (1 − α) as there can only be one. This is also evidenced by the more complicated partitions (more lines required to draw the partitions) of the simplex induced by the Bayes optimal classifier for α > 1 2 as shown in Figure 1. Notation: Throughout the paper, we let R = (−∞, ∞) and R + = [0, ∞). Let Z, Z + denote the sets of all integers and non-negative integers, respectively. For n ∈ Z + , we let [n] = {1, . . . , n}. For z ∈ R, we let z + = max(0, z). We denote by Δ n the probability simplex in R n : Δ n = {p ∈ R n + : For n ∈ Z + , we denote by 1 n and 0 n the n-dimensional all ones and all zeros vector, and for i ∈ [n] we denote by e n i the n-dimensional vector with 1 in position i and 0 elsewhere. Often we omit the dimension n from 1 n , 0 n , e n i as it is clear from the context. For any vector u, we denote by u (i) the i th element of the components of u when sorted in descending order. We denote by sign(u), the sign of a scalar u, with sign(0) = 1

Problem setup
In this section, we formally set up the problem of multiclass classification with an abstain option and explain the notion of consistency for the problem.
Let the instance space be X . Given training examples (

the goal is to learn a prediction function h : X →[n] ∪ {⊥}.
For any given α ∈ [0, 1], the performance of a prediction function h : X →[n]∪ {⊥} is measured via the abstain(α) loss α from Equation (1.1). We denote the loss incurred on predicting t when the correct label is y by α (y, t). For any The abstain(α) loss and a schematic representation of the Bayes classifier for various values of α given by Equation (1.2) are given in Figure 1 for n = 3.
Specifically, the goal is to learn a function h : Ideally, one wants the α -error of the learned function to be close to the optimal α -error An algorithm, which outputs a function h m : X →[n] ∪ {⊥} on being given a random training sample as above, is said to be consistent w.r.t. α if the α -error of the learned function h m converges in probability to the optimal for any distribution D: Here, the convergence in probability is over the learned classifier h m as a function of the training sample distributed i.i.d. according to D.
However, minimizing the discrete α -error directly is computationally difficult; therefore one uses instead a surrogate loss function ψ : [n] × R d →R + , for some d ∈ Z + , and learns a function f : X →R d by minimizing (approximately, based on the training sample) the ψ-error Predictions on new instances x ∈ X are then made by applying the learned function f and mapping back to predictions in the target space [n] ∪ {⊥} via some mapping pred : Under suitable conditions, algorithms that approximately minimize the ψerror based on a training sample are known to be consistent with respect to ψ, i.e., to converge in probability to the optimal ψ-error Also, when ψ is convex in its second argument, the resulting optimization problem is convex and can be efficiently solved. Hence, we seek a surrogate and a predictor (ψ, pred), with ψ convex over its second argument, and satisfying a bound of the following form holding for all f : where ξ : R→R is increasing, continuous at 0 and ξ(0) = 0. A surrogate and a predictor (ψ, pred), satisfying such a bound, known as an excess risk transform bound, would immediately give an algorithm consistent w.r.t. α from an algorithm consistent w.r.t. ψ. We derive such bounds w.r.t. the 1 2 loss for the Crammer-Singer surrogate, the one vs all hinge surrogate, and the BEP surrogate, with ξ as a linear function.

Excess risk bounds for the Crammer-Singer and one vs all hinge surrogates
In this section, we give an excess risk bound relating the abstain loss , and the Crammer-Singer surrogate ψ CS [5] and also the one vs all hinge loss. Define the Crammer-Singer surrogate ψ CS : [n] × R n →R + and predictor pred CS where (a) + = max(a, 0), u (i) is the i th element of the components of u when sorted in descending order and τ ∈ (0, 1) is a threshold parameter.
Similarly, define the one-vs-all surrogate ψ OVA : [n] × R n →R + and predictor pred OVA where (a) + = max(a, 0) and τ ∈ (−1, 1) is a threshold parameter, and ties are broken arbitrarily, say, in favor of the label y with the smaller index.
The following is the main result of this section, the proof of which is in Section 8.
The form of the abstaining region for the CS and OVA predictors arise due to the properties of the surrogate. In particular, due to the fact that the CS surrogate is invariant to adding a constant to all coordinates of the surrogate prediction u, the form of the CS abstaining region has to depend on the difference between two coordinates of u.

Remark 2.
It has been pointed out previously by Zhang [30], that if the data distribution D is such that max y p x (y) > 0.5 for all x ∈ X , the Crammer-Singer surrogate ψ CS and the one vs all hinge loss are consistent with the zero-one loss when used with the standard argmax predictor. This conclusion also follows from the theorem above. However, our result yields more -in the case that the distribution satisfies the dominant class assumption only for some instances x ∈ X , the function learned by using the surrogate and predictor (ψ CS , pred CS τ ) or (ψ OVA , pred OVA τ ) gives the right answer for such instances having a dominant class, and fails in a graceful manner by abstaining for other instances that do not have a dominant class.

Excess risk bounds for the BEP surrogate
The Crammer-Singer surrogate and the one vs all hinge surrogate, just like surrogates designed for conditional probability estimation, are defined over an ndimensional domain. Thus any algorithm that minimizes these surrogates must learn n real valued functions over the instance space. In this section, we construct a log 2 (n) dimensional convex surrogate, which we call the binary encoded predictions (BEP) surrogate, and give an excess risk bound relating this surrogate and the abstain loss. In particular these results show that the BEP surrogate is calibrated w.r.t. the abstain loss; this in turn implies that the convex calibration dimension (CC-dimension) [17] of the abstain loss is at most log 2 (n) .
The idea of learning log(n) predictors for an n-class classification problem has some precedent [1,24], but their objectives are focussed on the multiclass 0-1 loss, and they are not concerned about consistency or calibration of surrogates.
For the purpose of simplicity let us assume n = 2 d for some positive integer d. 1 where sign(u) is the sign of u, with sign(0) = 1 and τ ∈ (0, 1) is a threshold parameter.
To make the above definition clear, let us see what the surrogate and predictor look like for the case of n = 4 and τ = 1 2 . We have d = 2. Let us fix the mapping B such that B(y) is the standard d-bit binary representation of (y − 1), with −1 in the place of 0. Then we have, Figure 2 gives the partition induced by the predictor pred BEP . The following is the main result of this section, the proof of which is in Section 8.
Remark. The excess risk bounds for the CS, OVA, and BEP surrogates suggest that τ = 1 2 is the best choice for CS and BEP surrogates, while τ = 0 is the best choice for the OVA surrogate. However, intuitively τ is the threshold converting confidence values to predictions, and so it makes sense to use τ values closer to 0 (or −1 in the case of OVA) to predict aggressively in low-noise situations, and use larger τ to predict conservatively in noisy situations. Practically, it makes sense to choose the parameter τ via cross-validation.

BEP surrogate optimization algorithm
In this section, we frame the problem of finding the linear (vector valued) function that minimizes the BEP surrogate loss over a training set {(x i , y i )} m i=1 , with x i ∈ R a and y i ∈ [n], as a convex optimization problem. Once again, for simplicity we assume that the size of the label space is n = 2 d for some d ∈ Z + . The primal and dual versions of the resulting optimization problem with a norm squared regularizer are given below.
Primal problem: Consistency for abstaining multiclass algorithms 539 Dual problem: We optimize the dual as it can be easily extended to work with kernels. The structure of the constraints in the dual lends itself easily to a block coordinate ascent algorithm, where we optimize over {β i,j : j ∈ {0, . . . , d}} and fix every other variable in each iteration. Such methods have been recently proven to have exponential convergence rate for SVM-type problems [26], and we expect results of those type to apply to our problem as well.
The problem to be solved at every iteration reduces to a l 2 projection of a vector g i ∈ R d on to the set The projection problem is a simple variant of projecting a vector on the l 1 ball of radius 1, which can be solved efficiently in O(d) time [6]. The vector g i is such that for any j ∈ [d],

Extension to abstain(α) loss for α ≤ 1 2
The excess risk bounds derived for the CS, OVA hinge loss and BEP surrogates apply only to the abstain 1 2 loss. But it is possible to derive such excess risk bounds for abstain(α) with α ∈ 0, 1 2 with slight modifications to the CS, OVA and BEP surrogates.

H.G. Ramaswamy et al.
For any p ∈ Δ n , the u ∈ R n that optimises p ψ OVA (.), p ψ CS (.) and the u ∈ R d that optimises p ψ BEP (.) takes one of n+1 possible values. The modifications to these surrogates change the optimal values in the exact way to ensure that the modified surrogates are optimal for the abstain(α) loss. See equations 8.15 and 8.16 for the optimal u values for the OVA surrogate, equations 8.1 and 8.2 for the CS surrogate and equation 8.28 and 8.29 for the BEP surrogate.
One can get similar excess risk bounds for these modified surrogates as shown in Theorem below, the proof of which is in Section 8. Theorem 6.1. Let n ∈ Z + , τ ∈ (0, 1), τ ∈ (−1, 1) and α ∈ 0, 1 2 . Let n = 2 d . Then, for all f : X →R d , g : X →R n , Remark. When n = 2, the Crammer-Singer surrogate, the one vs all hinge and the BEP surrogate all reduce to the hinge loss and α is restricted to be at most 1 2 to ensure the relevance of the abstain option. Applying the above extension for α ≤ 1 2 to the hinge loss, we get the 'generalized hinge loss' of Bartlett and Wegkamp [2].

Experimental results
In this section, we give our experimental results for the proposed algorithms on both synthetic and real datasets. The synthetic data experiments illustrate the consistency of the three proposed algorithms for the abstain loss. The experiments on real data illustrate that one can achieve lower error rates on multiclass datasets if the classifier is allowed to abstain, and also show that the BEP algorithm has competitive performance with the other two algorithms

Synthetic data
We optimize the Crammer-Singer surrogate, the one vs all hinge surrogate and the BEP surrogate, over appropriate kernel spaces on a synthetic data set and show that the abstain 1 2 loss incurred by the trained model for all three algorithms approaches the Bayes optimal under various thresholds.
The dataset we used, with n = 8 classes and 2-dimensional features, was generated as follows. We randomly sample 8 prototype vectors v 1 , . . . , v 8 ∈ R 2 , with each v y drawn independently from a zero mean unit variance 2D-Gaussian, N (0, I 2 ) distribution. These 8 prototype vectors correspond to the 8 classes. Each example (x, y) is generated by first picking y from one of the 8 classes uniformly at random, and the instance x is set as x = v y + 0.65 · u, where u is independently drawn from N (0, I 2 ). We generated 12800 such (x, y) pairs for training, and another 10000 examples each for testing and hyper-parameter validation.
The CS, OVA, BEP surrogates were all optimized over a reproducing kernel Hilbert Space (RKHS) with a Gaussian kernel and the standard norm-squared regularizer. The kernel width parameter and the regularization parameter were chosen by grid search using the separate validation set. 2 As Figure 3 indicates, the expected abstain risk incurred by the trained model approaches the Bayes risk with increasing training data for all three algorithms and intermediate τ values. The excess risk bounds in Theorems 3.1 and 4.1 break down when the threshold parameter τ lies in {0, 1} for the CS and BEP surrogates, and in {−1, 1} for the OVA surrogate. This is supported by the observation that, in Figure 3 the curves corresponding to these thresholds perform poorly. In particular, using τ = 0 for the CS and BEP algorithms implies that the resulting algorithms never abstain.
Though all three surrogate minimizing algorithms we consider are consistent w.r.t. abstain loss, we find that the BEP and OVA algorithms use less computation time and samples than the CS algorithm to attain the same error. We note that for the BEP surrogate to perform well as above, it is critical to use a flexible function class (such as the RBF kernel induced RKHS as above). In particular, when optimized over a linear kernel function class the BEP surrogate performs poorly (experiments not shown here), due to its restricted representation power.

Real data
We ran experiments on real multiclass datasets from the UCI repository, the details of which are in Table 1. In the yeast, letter, vehicle and image datasets, a standard train/test split is not indicated, hence we create a random split ourselves.  All three algorithms (CS, OVA and BEP) were optimized over an RKHS with a Gaussian kernel and the standard norm-squared regularizer. The kernel width and regularization parameters were chosen through validation -10-fold cross-validation in the case of satimage, yeast, vehicle and image datasets, and a 75-25 split of the train set into train and validation for the letter and covertype datasets. For simplicity we set τ = 0 (or τ = −1 for OVA) during the validation phase in the first set of experiments. In the second set of experiments, we chose the value of τ along with the kernel width and regularisation parameters to optimise the abstain( 1 2 ) loss. The results of the first set of experiments with the CS, OVA and BEP algorithms are given in Table 2. The rejection rate is fixed at some given level (0%, 20% and 40%) by choosing the threshold τ for each algorithm and dataset appropriately. As can be seen from the Table, the BEP algorithm's performance is comparable to the OVA, and is better than the CS algorithm. However, Table  4, which gives the training and testing times for the algorithms, reveals that the BEP algorithm runs the fastest, thus making the BEP algorithm a good option for large datasets. The main reason for the observed speedup of the BEP is that it learns only log 2 (n) functions for a n-class problem and hence the speedup factor of the BEP over the OVA would potentially be better for larger n.
In the second set of experiments we fix the cost of abstaining α, to be equal to 1 2 . The kernel width, regularisation and threshold parameters are chosen to optimise the abstain( 1 2 ) loss in the validation phase. The abstain( 1 2 ) loss values for the CSA, OVA and BEP algorithm with tuned thresholds are given in Table  3. The most interesting values for this are on the vehicle and yeast dataset, where the final algorithms chose thresholds that abstain in the test set and perform marginally better than predicting some class on all instances, the loss values for which are simply given by the first three columns of Table 2.

Proofs
Both Theorems 3.1 and 4.1 follow from Theorem 6.1, whose proof we divide into three separate parts below.

Modified Crammer-Singer surrogate
Let γ(a) = max(a, −1). We have, Define the sets U 1 , . . . , U n , U ⊥ such that U i is the set of vectors u in R n , for which pred CS The following lemma gives some crucial, but straightforward to prove, (in)equalities satisfied by the Crammer-Singer surrogate.
where e y is the vector in R n with 1 in the y th position and 0 everywhere else.
We will prove the following theorem.
From Equations (8.6) and (8.7) we have Also p y ≤ 1 − p y ≤ α and u (1) = u y > u (2) + τ . Let u ∈ R n be such that u y = u y , u y = u y and u i = u i for all i / ∈ {y, y }. We have The second inequality above follows from the reasoning that the term is minimized when (u (1) − u (2) ) is as small as possible, which is τ in this case. We also have that From Equations (8.9) and (8.10) we have We have that ⊥ ∈ argmin t p α t Case 2a: u ∈ U τ ⊥ (or pred CS τ (u) = ⊥) The RHS of Equation (8.5) is zero, and hence becomes trivial.

Modified one-vs-all hinge
We have Define the sets U τ 1 , . . . , U τ n , U τ ⊥ such that U i is the set of vectors u in R n , for which pred OVA The following lemma gives some crucial, but straightforward to prove, (in)equalities satisfied by the OVA hinge surrogate.