Calibrated asymmetric surrogate losses

: Surrogate losses underlie numerous state-of-the-art binary clas- siﬁcation algorithms, such as support vector machines and boosting. The impact of a surrogate loss on the statistical performance of an algorithm is well-understood in symmetric classiﬁcation settings, where the misclassiﬁcation costs are equal and the loss is a margin loss. In particular, classiﬁcation-calibrated losses are known to imply desirable properties such as consistency. While numerous eﬀorts have been made to extend surrogate loss-based algorithms to asymmetric settings, to deal with unequal misclassiﬁcation costs or training data imbalance, considerably less attention has been paid to whether the modiﬁed loss is still calibrated in some sense. This article extends the theory of classiﬁcation-calibrated losses to asymmetric problems. As in the symmetric case, it is shown that calibrated asymmetric surrogate losses give rise to excess risk bounds, which control the expected misclassiﬁcation cost in terms of the excess surrogate risk. This theory is illustrated on the class of uneven margin losses, and the uneven hinge, squared error, exponential, and sigmoid losses are treated in detail.


Introduction
Surrogate losses are key ingredients in many of the most successful modern classification algorithms, including support vector machines and boosting. These losses are valued for their computational qualities, such as convexity, and facilitate the development of efficient algorithms for large-scale data sets. Given the success of surrogate loss-based algorithms, there has understandably been considerable interest in extending them from traditional binary classification to other learning problems.
In this work, we consider surrogate losses in the context of what might be called asymmetric binary classification problems. By this, we mean at least one of the following two descriptions applies to the learning problem: (1) the misclassification costs are asymmetric, meaning the performance measure is a cost-sensitive classification risk; or (2) the loss is asymmetric, meaning it is not a margin loss, and therefore the two classes are treated differently.
Unfortunately, in most cases, these additional scaling factors are set in a heuristic fashion or treated as tuning parameters, without regard for the theoretical statistical properties of the algorithms (some exceptions are noted below). Given the considerable interest in asymmetric binary classification problems, and given the proliferation of heuristic asymmetric surrogate losses, there is a need for a theory to guide practitioners in the design of such losses, and to enable performance analysis.
To address this need, we present a theory for calibrated asymmetric surrogate losses. Intuitively, a surrogate loss is calibrated if convergence of the surrogate excess risk to zero implies convergence of the target excess risk to zero. Calibration has been used to establish consistency of several classification algorithms in the traditional cost-insensitive setting [3,14,15,30,44]. An elegant theory for calibrated surrogate losses was developed by Bartlett, Jordan and McAuliffe [2] and extended by Steinwart [31]. However, these works do not consider the asymmetric classification problem considered here. Nonetheless, we will show that the techniques of these two works can be extended to the asymmetric setting.
The primary contribution of this work is to extract and synthesize certain key insights of [2] and [31], generalize and tailor them to the asymmetric classification problem, and present them in a sufficiently general way that they can be adopted in a variety of practical scenarios. The broader purpose of this article is to offer a more rigorous framework to those researchers who continue to develop and apply algorithms based on asymmetric surrogate losses.
The rest of the paper is structured as follows. Section 2 discusses background material and related work on calibrated surrogate losses and excess risk bounds. Section 3 develops a general framework for calibrated asymmetric surrogate losses and excess risk bounds. The special case of cost-insensitive classification with asymmetric losses is considered, and a refined treatment is also given for the case of convex losses. Section 4 examines a special class of asymmetric surrogate losses, called uneven margin losses, in detail. A concluding discussion is offered in Section 5. Appendices A, B, and C, respectively, contain additional connections to Steinwart [31] and calibration functions, proofs of supporting lemmas, and uneven margin loss details.

Background and related work
Binary classification is concerned with the prediction of a label Y ∈ {−1, 1} from a feature vector X by means of a classifier. A classifier will be represented as a mapping x → sign(f (x)) where f is a real-valued function, called a decision 960 C. Scott function in this context. The goal of classification is to learn f from a training sample (X 1 , Y 1 ), . . . , (X n , Y n ). When the cost of misclassifying X is not dependent on Y , the performance of f is typically measured by the cost-insensitive risk R(f ) = E X,Y [1 {Y =f (X)} ]. Unfortunately, the minimization of the empirical risk 1 n n i=1 1 {Yi =f (Xi)} , over some class of decision functions, is often intractable. Therefore it is common in practice to instead minimize the empirical version of the L-risk R L (f ) = E X,Y [L(Y, f (X))], where L(y, t) is some surrogate loss, chosen for its computational qualities such as convexity.
For example, support vector machines are obtained by minimizing the regularized empirical L-risk over H, where L(y, t) = max{0, 1 − yt} is the hinge loss, H is a reproducing kernel Hilbert space [32], and λ > 0 is a regularization parameter. As another example, AdaBoost can be viewed as functional gradient descent of the empirical L-risk 1 where L(y, t) = exp(−yt) is the exponential loss, and minimization is performed over the set of linear combinations of decision functions from some base class [19].
Bartlett et al. [2] study conditions under which consistency with respect to an L-risk implies consistency with respect to the target risk R(f ). To be more specific, let R * and R * L denote the minimal risk and L-risk, respectively, over all possible decision functions. The quantities R(f ) − R * and R L (f ) − R * L will be referred to as the target excess risk and surrogate excess risk, respectively. We will also use the term regret interchangeably with excess risk. Bartlett et al. examine when there exists an invertible function θ with θ(0) = 0 such that for all f and all distributions on (X, Y ). We refer to such a relationship as an excess risk bound or regret bound. Bartlett et al. study margin losses, which have the form L(y, t) = φ(yt) for some φ : R → [0, ∞). They show that non-trivial surrogate regret bounds exist precisely when L is classification-calibrated, which is a technical condition they develop. Note that margin losses are symmetric in the sense that L(y, t) = L(−y, −t).
This work extends the work of Bartlett et al. in two ways. First, we consider risks that account for label-dependent misclassification costs. Second, we study asymmetric surrogate losses, not just margin losses. Such losses have advantages for training with imbalanced data, as discussed below.
We develop the notion of α-classification calibrated losses, and show that non-trivial excess risk bounds exist when L is α-classification calibrated, where α ∈ (0, 1) represents the misclassification cost asymmetry. This condition is a natural generalization of classification calibrated. We also give results that facilitate the calculation of these bounds, and verification of which losses are α-classification calibrated.
To illustrate the theory of calibrated asymmetric surrogate losses, we study in some detail the class of uneven margin losses, which have the form for some φ : R → [0, ∞) and β, γ > 0. Various instances of such losses have appeared in the literature (see Sec. 4 for specific references), primarily as heuristic modifications of margin losses to account for cost asymmetry or imbalanced data sets. They are computationally attractive because they can typically be optimized by modifications of margin-based algorithms. However, statistical aspects of these losses have not been studied. We characterize when they are αclassification calibrated and compute explicit surrogate regret bounds for four specific examples of φ.
When applied to uneven margin losses, our work has practical implications for adapting well-known algorithms, such as Adaboost and support vector machines, to settings with imbalanced data or label-dependent costs. These are discussed in the concluding section.
Steinwart [31] extends the work of Bartlett et al. in a very general way that encompasses several supervised and unsupervised learning problems. He applies this framework to cost-sensitive classification, but restricts his attention to margin-based losses. His framework provides for an alternate derivation of an excess risk bound for the asymmetric binary classification problem. This bound is equivalent to the bound presented in Theorem 3.1 below, which is obtained by generalizing the approach of Bartlett et al. For completeness, this alternate perspective is presented in Appendix A.
Reid and Williamson [23,25] also study α-classification calibrated losses and derive surrogate regret bounds for cost-sensitive classification. Their focus is on proper losses and class probability estimation, and unlike the present work, they impose certain conditions on the surrogate loss, such as differentiability everywhere. Therefore they do not address important losses such as the hinge loss. In addition, their bounds are not in the form of (2.1), but rather are stated implicitly. We also note that their examples of surrogate regret bounds [23] consider only margin losses. Santos-Rodríguez et al. [27] apply Bregman divergences to multiclass cost-sensitive classification, also with an emphasis on proper losses and posterior probability estimation. A relationship between the present work and proper losses is discussed at the end of Section 4.
Zhang [43] studies classification-calibrated losses for multiclass classification and establishes consistency of various algorithms. While he does consider a cost-sensitive risk, excess risk bounds are only developed for the cost-insensitive case. Furthermore, the specific losses considered are multi-class margin losses, and therefore do not accommodate asymmetric losses such as uneven margin losses. Tewari and Bartlett [35] also study classification calibrated losses for multiclass classification. They also consider the case of equal misclassification costs, and their examples are symmetric in nature.
Scott [28] develops excess risk bounds for cost-sensitive classification with example-dependent costs. The setting considered there encompasses the setting here as a special case. However, when specialized to the present setting, those results are less precise and extensive than what we are able to obtain by a more direct analysis. For example, those results require distributional assumptions, and the excess risk bounds involve unknown constants, whereas here the results are distribution-free and bounds can be calculated explicitly.
Among the numerous approaches cited in the introduction, some authors have employed calibrated loss functions in the design of cost-sensitive classification algorithms [12,16,18]. The losses of Lin, Lee and Wahba [12] and Masnadi-Shirazi and Vasconcelos [16] are special cases of the losses considered here, while Masnadi-Shirazi and Vasconcelos [18] present a general procedure for constructing losses that are calibrated and give rise to proper losses for class probability estimation. In these papers, excess risk bounds are not derived, and consistency of these algorithms is not established. With the results presented in this paper, it is now possible to prove cost-sensitive consistency for a wide class of algorithms based on surrogate losses. See Section 3.1.
We further note that the two recent papers by Masnadi-Shirazi and Vasconcelos [16,18], on cost-sensitive boosting and support vector machines, demonstrate excellent performance relative to competing algorithms. This is evidence for the practical advantage of uneven margin losses and of requiring asymmetric surrogate losses to be calibrated.
Additional comparisons to related work are given throughout the paper. Finally, we remark that in the literature, the terms Fisher consistent and admissible have also been used for the term classification-calibrated.

Surrogate losses and regret bounds
Let (X, Y ) have distribution P on X × {−1, 1}. Let F denote the set of all measurable functions f : X → R. Every f ∈ F defines a classifier by the rule x → sign(f (x)), and in this context f is called a decision function. We adopt the convention sign(0) = −1.
A loss for binary classification is a measurable function L : {−1, 1} × R → [0, ∞). Any loss can be written We refer to L 1 and L −1 as the partial losses of L.
The cost-sensitive classification loss with cost parameter α ∈ (0, 1) is When L = U α , we write R α (f ) and R * α instead of R Uα (f ) and R * Uα . Although other parametrizations of cost-sensitive classification losses are possible, this one is convenient because an optimal classifier is sign(η(x)−α) where η(x) := P (Y = 1|X = x). See Lemma 3.1 below.
We are motivated by applications where it is desirable to minimize the U αrisk, but the empirical U α -risk cannot be optimized efficiently. In such situations it is common to minimize the (empirical) L-risk for some surrogate loss L that has a computationally desirable property such as differentiability or convexity in the second argument.
The following lemma collects some important properties of the risk associated to the cost-sensitive classification loss U α .
For any f ∈ F , The proof appears in Appendix B. This section has three parts. In 3.1 the work of Bartlett, Jordan and McAuliffe [2], on surrogate regret bounds for margin losses and cost-insensitive classification, is extended to general losses and cost-sensitive classification. In 3.2, the important special case of cost-insensitive classification with general losses is treated, and in 3.3, some results for the case of convex partial losses are presented.

C. Scott
Intuitively, L is α-CC if, for all x such that η(x) = α, the value of t = f (x) minimizing the conditional L-risk has the same sign as the optimal predictor η(x) − α.
We will argue that an excess risk bound exists if and only if L is α-CC, and give an explicit construction of the bound in terms of L. The construction of the bound is, intuitively, based on two ideas. First, conditioned on X, the conditional target regret ǫ = C α (η, t) − C * α (η) is related to the worst possible conditional surrogate regret, given η = η(X). This is captured by the variable ν L,α (ǫ) below. Second, a bound in terms of the excess risks is obtained by integrating over X.
To preserve the inequality in the second step, it is necessary to replace ν L,α by a tight convex lower bound, so that Jensen's inequality may be applied. The details now follow. Denote and for α ≥ 1 2 , where g * * denotes the Fenchel-Legendre biconjugate of g. The biconjugate of g is the largest convex lower semi-continuous function that is ≤ g, and is defined by where Epi g = {(r, s) : g(r) ≤ s} is the epigraph of g, co denotes the convex hull, and the bar indicates set closure.
The next lemma gives some important properties of the above quantities.
1. For all f ∈ F and all distributions P , Notice that the first part of the theorem holds for all losses. However, it is possible that ψ L,α is not invertible. Since ψ L,α (0) = 0 and ψ L,α is convex, this would mean ψ L,α (ǫ) = 0, 0 ≤ ǫ ≤ ǫ 0 , for some ǫ 0 , and it could happen that The second part of the theorem says precisely when an excess risk bound exists.
. Since ψ L,α (0) = 0 and ψ L,α is nondecreasing, the same is true of ψ −1 L,α . As a result, we can show that an algorithm that is consistent for the L-risk is also consistent for the α cost-sensitive classification risk.
Hence R α ( f n ) − R * α → 0 with probability one. At this point in the exposition, it would be desirable to give an example of an excess risk bound for a specific loss. However, there are some additional results needed to enable the calculation of H L,α . These are developed in the next two subsections. Then in Section 4, we will calculate some explicit bounds for uneven margin losses.

Cost-insensitive classification
We turn our attention to the cost-insensitive or 0/1 loss, This loss is not only important in its own right, but the associated quantity H L , defined below, is useful for calculating H L,α when α = 1 2 , as explained below.
The results in this section generalize those of Bartlett, Jordan and McAuliffe [2], who focus on margin losses. Here no restrictions are placed on the partial losses L 1 and L −1 .
For an arbitrary loss L, define Also define for ǫ ∈ [0, 1] The following definition was introduced by Bartlett, Jordan and McAuliffe [2] in the context of margin losses.
L is said to be classification calibrated, and we write L is CC.
For margin losses, this coincides with the definition of [2], and our H L equals theirψ. Also note that H L (η) = H L,1/2 (η), and therefore L is CC iff , and C * U (η), respectively. Theorem 3.2. Let L be a loss.
1. For any f ∈ F and any distribution P , Proof. The proof follows from Theorem 3.1 and the relationships C(η, . Thus, to prove 1, note When L is a margin loss, H L is symmetric with respect to η = 1 2 , and the above result reduces to the surrogate regret bound established by Bartlett, Jordan and McAuliffe [2].
The following extends a result for margin losses noted by Steinwart [31]. For any loss L, we can express H L,α in terms of H L . This facilitates the calculation of H L,α and therefore ν L,α and ψ L,α .
Given the loss L(y, t) .
Using these expressions, we can relate C Lα to C L as This observation gives rise to the following result.

Convex partial losses
When the partial losses L 1 and L −1 are convex, we can deduce some convenient characterizations of α-CC losses.
A similar result appears in Reid and Williamson [24], and when the loss is a composite proper loss the results are equivalent. Their result is expressed in the context of class probability estimation, while our result is tailored directly to classification. Although the proofs are essentially the same, our setting allows us to state the result without assuming the loss is differentiable everywhere. Thus, it encompasses losses that are not suitable for class probability estimation, such as the uneven hinge loss described below. We also make an observation in the special case where α = 1 2 and L is a margin loss, also noted by Reid and Williamson [24]. Then L ′ 1 (0) = φ ′ (0) and L ′ −1 (0) = −φ ′ (0), and (3.4) is equivalent to φ ′ (0) < 0, the condition identified by Bartlett, Jordan and McAuliffe [2].
The following result facilitates calculation of regret bounds.   Proof.

Uneven margin losses
We now apply the preceding theory to a special class of asymmetric losses. and as uneven margin losses.
When β = γ = 1, L in Definition 4.1 is a conventional margin loss, and L α can be called an α-weighted margin loss. Since they differ from margin losses by a couple of scalar parameters, empirical risks based on uneven margin losses can typically be optimized by modified versions of margin-based algorithms.
Before proceeding, we offer a couple of comments on Definition 4.1. First, although β may appear redundant in L α , it is not. α is fixed at a desired cost parameter, and thus is not tunable. Second, there would be no added benefit from a loss of the form 1 {y=1} φ(γ ′ t) + 1 {y=−1} βφ(−γt). We may assume γ ′ = 1 without loss of generality since scaling a decision function f by a positive constant does not alter the induced classifier. However, alternate parametrizations such as 1 {y=1} φ((1 − ρ)t) + 1 {y=−1} βφ(−ρt), ρ ∈ (0, 1), might be desirable in some situations.
A common motivation for uneven margin losses is classification with an imbalanced training data set. In imbalanced data, one class has (often substantially) more representation than the other, and margin losses have been observed to perform poorly in such situations. Weighted margin losses, which have the form α ′ 1 {y=1} φ(t)+(1−α ′ )1 {y=−1} φ(−t), are often used as a heuristic for imbalanced data, with α ′ serving as a tunable parameter. However, there is no reason why the α ′ that yields good performance on imbalanced data will be the desired cost parameter α. In other words, this heuristic typically results in losses that are not α-CC.
The parameter γ offers another means to accommodate imbalanced data. Such losses have previously been explored in the context of specific algorithms, including the perceptron [11], boosting [16], and support vector machines [10,18,42]. Uneven margins (γ = 1) have been found to yield improved empirical performance in classification problems involving label-dependent costs and/or imbalanced data.
Prior work has not addressed whether uneven margin losses, in the general form presented here, are CC or α-CC. The following result clarifies the issue for convex φ.
Corollary 4.1. Let φ be convex and differentiable at 0, let β, γ > 0 and let L, L α be the associated uneven margin losses as in Definition 4.1. The following are equivalent: The equivalence of (a) and (b) follows from Theorem 3.3, and the equivalence of (b) and (c) follows from Theorem 3.4.
This result implies that for any α ∈ (0, 1) and γ > 0, is α-CC provided φ is convex and φ ′ (0) < 0. For such φ we have therefore reached the following conclusion: γ is a parameter that can be tuned as needed, such as for imbalanced data, while the loss remains α-CC. Figure 1 displays the partial losses for three common φ and for three values of γ. If φ is not convex, then uneven margin losses can still be α-CC, but the

Uneven squared error loss
The minimizer of C L (η, t) is .
This yields (after some algebra) and therefore .

Applying Equation (3.3) and after some simplification, we obtain
. Figure 3 show plots of H Lα,α and ν Lα,α for various values of α and γ. We see again evidence that ν Lα,α can be discontinuous at min(α, 1 − α). As in the other example, we have not indicated ψ Lα,α . Yet it can easily be visualized as the largest convex minorant of ν Lα,α . In many cases, ν Lα,α is actually convex and hence equals ψ Lα,α . The same comment applies to the hinge and exponential examples.

Uneven exponential loss
Now let φ(t) = e −t and consider

C. Scott
Then is minimized by .
Since φ is not convex, we cannot conclude L is CC. In fact, we will show that L is α-CC for α = (3 + 4 √ 2)/23 ≈ 0.37639. Figure 5 shows as a function of t, for six different η. These graphs are useful in understanding C − L,α (η) and C * L (η). When η < 1 2 , it can be shown that C L (η, t) has a single local minimum and a single local maximum. When η ≥ 1 2 , on the other hand, C L (η, t) is strictly decreasing. Let t − (η) denote the local minimizer when η < 1 2 . This function can be expressed in closed form. See Appendix C for these and other details.

Relation to proper losses
We briefly mention a relationship between uneven margin losses and proper losses for class probability estimation. Proper losses, and their relationship to calibrated losses, have recently been studied by Reid and Williamson [24] and Masnadi-Shirazi and Vasconcelos [17]. We begin by introducing these concepts.
A loss function for binary classification can be converted to a loss function for class probability estimation through a link function, which is an invertible function ψ : [0, 1] → R. If L is a loss for binary classification, then ℓ(y, η) := L(y, ψ( η)), is a loss for class probability estimation. Reid and Williamson [24] refer to such losses as composite binary losses, and give a necessary condition on the link function for ℓ in (4.2) to be proper. Specifically, their Corollary 12 states that if the partial losses L 1 and L −1 of L are differentiable, then the link must satisfy Calibrated asymmetric surrogate losses 981 for ℓ to be proper. The significance of this result is that it justifies the following approach to class probability estimation: Use a loss L to learn a classifier (which is a well studied problem with many efficient algorithms), and then map the resulting decision function f to a class probability estimator via the relation This result can be applied to uneven margin losses provided φ is differentiable (thus the uneven hinge loss is excluded). For example, for the uneven exponential loss with cost parameter α ∈ (0, 1) and uneven margin parameter γ > 0, we find that For the uneven sigmoid loss, the right-hand side of (4.3) is not invertible, and therefore the uneven sigmoid loss cannot give rise to a proper loss. It would be interesting to investigate whether uneven margin losses offer any advantages for the estimation of class probabilities.

Discussion
The results of Bartlett, Jordan and McAuliffe [2] concerning surrogate regret bounds and classification calibration are generalized to label-dependent misclassification costs and arbitrary losses. Some differences that emerge in this more general framework are that H L,α (η) is in general not symmetric about η = 1 2 , and ν L,α (ǫ) is potentially discontinuous at ǫ = min(α, 1 − α).
The class of uneven margin losses are examined in some detail. We hope these results provide guidance to future work with such losses, as our theory explains how to ensure α-classification calibration for any margin asymmetry parameter γ > 0. For example, Adaboost is often applied to heavily imbalanced data sets where misclassification costs are label-dependent, such as in cascades for face detection [38]. It should be possible to generalize Adaboost to have an uneven margin (to accommodate imbalanced data) while being α-classification calibrated for any α ∈ (0, 1). In particular, the uneven exponential loss from Sec. 4.3 can be optimized by the functional gradient descent approach. In fact, Masnadi-Shirazi and Vasconcelos [16] developed such an algorithm for the special case γ = α/(1 − α), but did not identify the generalization to arbitrary γ.
Our theory also sheds light on the support vector machine with uneven margin. Yang, Yang and Wang [42] describe an implementation of this algorithm, but they allow for both β and γ to be free parameters. Our Corollary 4.1 constrains β = 1/γ for classification calibration, which eliminates a tuning parameter.

C. Scott
In closing, we mention two additional directions for future work. First, an interesting problem related to uneven margin losses is that of surrogate tuning, which in this case is the problem of tuning the parameter γ to a particular data set. Nock and Nielsen [21] have recently described a data-driven approach to surrogate tuning of classification-calibrated (α = 1 2 ) losses. Second, our regret bounds should be applicable to proving cost-sensitive consistency and rates of convergence for specific algorithms based on surrogate losses.

Acknowledgements
The author would like to thank the anonymous reviewers for their feedback.

Appendix A: The calibration function perspective
In this appendix we present an alternative, though ultimately equivalent, approach to excess risk bounds for asymmetric binary classification problems. Additional properties of α-CC losses are derived, and connections to Steinwart [31] are established. We begin with an alternate definition of α-classification calibrated.
Lemma A.1. Let α ∈ (0, 1). For any loss L, Proof. To prove 1, let ǫ > 0, η ∈ [0, 1]. In Lemma 3.1 it is shown that Steinwart [31] employs α-CC' as the definition of classification calibrated in the case of cost-sensitive classification. Although α-CC implies α-CC', the reverse implication is not true as the counterexample L = U α demonstrates (perhaps ironically). Under a mild assumption on the partial losses, Steinwart's definitions and ours agree. This is part 1 of the following result. Under this same mild assumption, we can also express what Steinwart calls the calibration function and uniform calibration function of L. These are the quantities δ(ǫ, η) and δ(ǫ) in parts 2 and 3, respectively.
Theorem A.1. Assume L 1 and L −1 are continuous at 0.
An emphasis of Steinwart [31] is the relationship between surrogate regret bounds and uniform calibration functions. In our setting, Lemma A.1 part 2 directly implies a surrogate regret bound in terms of µ L,α .
Theorem A.2. Let L be a loss, α ∈ (0, 1). Then This result is similar to Theorem 2.13 of Steinwart [31] and surrounding discussion. While that result holds in a very general setting that spans many learning problems, Theorem A.2 specializes the underlying principle to costsensitive classification.

985
Thus, for any loss we have two surrogate regret bounds. In fact, the two bounds are the same.
The following result generalizes Lemma A.7 of Steinwart [31], and completes the proof of Theorem A.3.

Appendix B: Proofs of Lemmas 3.1 and 3.2
These lemmas support the development in Section 3.1.