Linear scoring rules for probabilistic binary classiﬁcation

: Probabilistic binary classiﬁcation typically calls for a vector of marginal probabilities where each element gives the probability of assigning the corresponding case to class 1. Scoring rules are principled ways to assess probabilistic forecasts about any outcome that is subsequently observed. We develop a class of proper scoring rules called linear scoring rules that are speciﬁcally adapted to probabilistic binary classiﬁcation. When applied in competition situations, we show that all linear scoring rules essentially balance the needs of organizers and competitors. Linear scoring rules can also be used to train classiﬁers. Finally, since scoring rules have a statistical decision theoretic foundation, a linear scoring rule can be constructed for any user-deﬁned misclassiﬁcation loss function.


Introduction
Classification challenges have become an exciting and useful feature of the statistical and machine learning community. Given a labelled training dataset, contestants are invited to submit their classifications for a test dataset. In order to make the challenge more interesting, challenge organizers typically publish a ranked list of the leading submissions and, ultimately, announce the winner of the challenge. However, in order for such a competition to be considered worth entering, the challenge organizers must be seen to evaluate the submissions in a fair and open manner.
Perhaps the most common classification challenge involves probabilistic binary classification. Suppose there are n test cases and that y = (y i ∈ {0, 1} | i = 1, . . . , n) is the vector of labels known only to the challenge organizers. Contestants are asked to submit a vector of probabilities ω = (ω i ) with the interpretation that ω i = P(Y i = 1). Note that the contestant is being asked for marginal probabilities only. How can such probabilistic classifications be evaluated?
Scoring rules were devised precisely to answer this kind of question. Scoring rules are a principled way to assess probabilistic forecasts about any outcome that is subsequently observed. Crucially, proper scoring rules elicit honest statements of belief about the outcome. In the context of the probabilistic classification challenge, if the challenge organizers use a proper scoring rule to evaluate submissions, a competitor's expected score under their true belief about the class labels will be minimized 1 by actually quoting that belief to the organizers. A proper scoring rule therefore rules out any possibility of a competitor gaming the challenge.
Scoring rules have long been applied to forecasts of binary outcomes. Indeed, in one of the first papers on the subject, Brier (Brier, 1950) explicitly considered the case of a sequence of weather forecasts for rain or no rain. Almost all discussion, however, has centered on sequential or online evaluation of forecasters. Here our focus is on batch evaluation. Some of our results are anticipated in the technical report by Buja et al. (Buja, Stuetzle and Shen, 2005), though they implicitly assume additive scoring rules -see section 2.3. Banerjee et al. (Banerjee, Guo and Wang, 2005) considered loss functions that are minimized in expectation by quoting the expected value of the outcome in question. When restricted to binary outcomes, their loss functions can be recast as scoring rules that are essentially given by eq. 3. The excellent review article by Gneiting and Raftery (Gneiting and Raftery, 2007) includes discussion of scoring rules for categorical outcomes; these are superficially similar to the scoring rules introduced here but the quoted probabilities are constrained to lie on the simplex. Recently, Byrne (Byrne, 2016) has written about area-under-the-curve (AUC) measures for probabilistic forecasting. In his elegant formulation of the problem, when only marginal probabilities are quoted, the concept of a scoring function is invoked, as opposed to a scoring rule. Finally, Frongillo and Kash (Frongillo and Kash, 2015) have recently considered the general problem of devising proper scoring rules to elicit vector-valued properties of a distribution. In their terminology, a property is linear if it is a linear function on the space of distributions. The linear scoring rules in this paper can be understood as eliciting the vectorvalued linear property that is the marginal probabilities.
In section 2, we introduce the class of linear scoring rules and contrast them with more general but more complicated scoring rules. We find three useful sub-classes of linear scoring rules: additive, homogeneous and rank-based. We also make a connection between a particular rank-based linear scoring rule and the AUC measure. Section 3 is an aside on using linear scoring rules to train probabilistic classifiers. In section 4, we show how linear scoring rules fit within statistical decision theory. We are able to show that there is a linear scoring rule which accounts correctly for any user-defined misclassification loss function. Finally, in section 5, we show that all linear scoring rules essentially achieve the same balance between the organizers' need for discriminative power and the competitors' wish not to be penalized unduly by outliers.

The class of linear scoring rules
To fix notation, let y ∈ Y := {0, 1} n be an observed outcome of class labels and let ω ∈ P := [0, 1] n be a vector of probabilities. We will refer to these probabilities as marginal probabilities to emphasize the fact that ω is not a joint probability from P Y , the class of all distributions on Y. Note that the restriction to P is not done for convenience but rather to fit in with the framework of the classification challenge: competitors are asked to quote marginal probabilities, not a joint distribution. We will be interested in scoring rules S : Y × P → R ∪ {∞} and will say S(y, ω) is the score for quoting ω and observing y.
The fact that P is convex is crucial to what follows. For p ∈ P Y , let Mp denote the product of its marginal probabilities. More precisely, (Mp) Then, as is well known, MP Y is not convex: for p, q ∈ P Y and λ ∈ (0, 1), typically there is no p(λ) ∈ P Y such that Mp(λ) = (1 − λ)Mp + λMq. For this reason, we do not look to define scoring rules on Y × MP Y . However, this does mean a competitor's quote need not be derived from a joint distribution.
We overload the notation for scoring rules by defining the expected score for each π ∈ P, where Y ∼ π is shorthand for Y i ind ∼ Bern(π i ), i = 1, . . . , n. So defined, S(π, ω) is affine in its first argument. A scoring rule is said to be proper in P, if S(π, ω) ≥ S(π, π), for all π, ω ∈ P. A scoring rule is said to be strictly proper if equality holds for ω = π only. Note that the scoring rules we discuss in this paper remain proper in P Y , though not strictly proper.
As indicated previously, a proper scoring rule will elicit an honest statement of a competitor's belief. To see this, suppose π represents the competitor's actual belief about the class labels. Then S(π, ω) will be their expected score under their actual belief if they quote ω. But, if the scoring rule is proper, their expected score cannot be less than S(π, π), hence they should quote π.

Linear scoring rules
A great deal is known about how to generate proper scoring rules (McCarthy, 1956;Hendrickson and Buehler, 1971;Gneiting and Raftery, 2007). For our situation, Theorem 1 of Gneiting and Raftery (2007) ensures that under mild regularity conditions and for convex P, S(·, ω) will be a (strictly) proper scoring rule iff there exists a (strictly) concave function H(ω) such that where When H(ω) is differentiable, the expression for the scoring rule simplifies to where ∇H(ω) := (∂H/∂ω i ).
We call a scoring rule that is derived from eq. (2) a linear scoring rule. This is motivated by the fact that such a scoring rule is a linear function of the class labels y. For convenience, we will always assume H(ω) is concave so that a linear 2 scoring rule is also necessarily proper. Eq. (3) is also anticipated in Banerjee, Guo and Wang (2005). Banerjee et al. (Banerjee, Guo and Wang, 2005) found the necessary form of loss functions L(y, ω) that are minimized in expectation by predicting ω = E Y for the outcome y. When y is restricted to binary outcomes, their loss functions are our linear scoring rules, since π = E Y ∼p Y .
A useful consequence of linearity is that even though in truth Y ∼ p ∈ P Y , still S(p, ω) = S(π, ω), where π is the resulting vector of marginal probabilities for the class labels.

Connection to other scoring rules
It is important to realize that linear scoring rules do not exhaust the forms of scoring rules that can be applied to probabilistic binary classification. Indeed, the obvious approach is the indirect one: take existing proper scoring rules S(y, q) on Y × P Y , and then restrict consideration to probability distributions However, apart from the logarithmic scoring rule, S(y, q) = − log q(y), the resulting scoring rules are rather unwieldy.
Consider for example, the Brier scoring rule, which for q ∈ P Y takes the form S(y, q) = −q(y) Thus linear scoring rules have the appeal of simplicity and tractability.
Having said that, linear scoring rules have a slightly reduced flexibility under certain additive transformations. Typically, if S(y, q) is a scoring rule then so is S(y, q) + k(y). For linear scoring rules, however, k(y) must take the form k(y) = k · y + c.

Additive sub-class
We call a scoring rule additive if it is generated by an entropy function of the form Note that Frongillo and Kash (Frongillo and Kash, 2015) refer to additivity as separability.
In most applications, we expect that h i (s) = h(s), for each i. Common examples include h(s) = −s log s − (1 − s) log(1 − s), which leads to the logarithmic scoring rule, and h(s) = 1 2 s(1 − s), which leads to the linear class version of the Brier scoring rule, S(y, ω) = 1 2 y − ω 2 .
Additive scoring rules also have a "local" property: the score for test case i depends on ω i but not on the quoted probability for any other case. Note that this is a different type of locality to that of local scoring rules (Parry, Dawid and Lauritzen, 2012); there locality refers to the (relative) lack of dependence on the quoted probability for unrealized outcomes.
An interesting twist on the usual additive scores comes from considering h i (s) = w i h(s), where the w i are weights satisfying w i > 0 and i w i = 1. The ensuing scoring rule then weights the test cases differently. While this is a proper scoring rule, if competitors are to make use of the weighting scheme, they should also be given the weighting scheme for the training data.

Homogeneous sub-class
Recall that a function f (s) is said to be homogeneous of order k or k-homogeneous, if f (λs) = λ k f (s), for λ > 0. We call a scoring rule homogeneous if it is generated by an entropy function that is 1-homogeneous (up to an irrelevant additive constant). When H(ω) is also differentiable, 1-homogeneity implies H(ω) = ω · ∇H(ω), and the associated scoring rule takes the very simple form, and is 0-homogeneous. Pseudospherical scoring rules are examples of homogeneous scoring rules. They arise from the fact that the L α -norm, In the limit α → ∞, i.e. the L ∞ -norm, we have the zero-one scoring rule where M (ω) = {j | ω j = max{ω}} and #A denotes the cardinality of A. In slightly different contexts, this is sometimes referred to as the misclassification loss.
The only scoring rule that is both additive and homogeneous is the trivial scoring rule, S(y, ω) = k · y.

Rank-based sub-class
A scoring rule is said to be rank-based if it depends only on the ranks of the quoted probabilities ω. As a consequence, a rank-based scoring rule cannot be strictly proper. A rank-based scoring rule is also a homogeneous scoring rule.
Here we give only an important example of a rank-based scoring rule. Let so that ψ i (ω) is the net number of elements of ω that are exceeded by ω i . Then H(ω) = −ω · ψ(ω) is both 1-homogeneous and concave, where ψ(ω) := (ψ i (ω)). One-homogeneity is immediate since ψ(ω) 0-homogeneous. To show concavity, first note that because H(ω) is a collection of planar surfaces essentially indexed by the rank sets of ω, it suffices to consider what happens on either side of ω i = ω k , for an arbitrary pair (i, k). Without loss of generality, fix k and let M k (ω) = {j | ω j = ω k }. Letting superscript 0 indicate the value of a quantity when ω i = ω k , and superscript ± indicate the value of a quantity when ω

Continuity of H(ω) follows since H
Byrne (Byrne, 2016) has shown that this is related to the Wilcoxon-Mann-Whitney U -statistic and to the area-under-the-curve (AUC) measure, which is very commonly used in classification challenges. Specifically, if we define n 1 = i y i to be the number of test cases of class 1, then AUC(y, ω) = 1 2 1 − 1 n1(n−n1) · (−y · ψ(ω)) n 1 = 0, n, where we follow Byrne (Byrne, 2016) and define the AUC to be 1 2 , when n 1 = 0, n. Although the AUC appears to be a scaled, positively oriented scoring rule, in general it is not. However, if n 1 is known beforehand -sometimes this information is provided to challenge contestants -then the AUC is in the class of linear scoring rules. Byrne (Byrne, 2016) also shows that the AUC is a proper scoring rule in the (unrealistic) case that Y ∼ ω ∈ P.
The obvious connection between eq. (12) and eq. (13), however, enables us to give a simple geometric picture of the rank-based scoring rule introduced here. Figure 1 is a plot of false positive counts (FP) vs. true positive counts (TP) for all thresholds between 0 and 1. Then S(y, ω) = −2 × (area above the diagonal).

Training with linear scoring rules
Given a set of features or predictors x ∈ X ⊆ R p , a rather general approach to probabilistic classification is to let where F : (−∞, ∞) → [0, 1] is a cumulative distribution function and θ ∈ R p is a parameter vector to be estimated from the training data. This framework includes logistic and probit regression as special cases. We now show that there is a natural additive scoring rule associated with each continuous cdf F (·).

Lemma 1. If Q(·) is the quantile function associated with the cdf
is concave and generates an additive scoring rule. Using this in eq. (3) and after integrating by parts and a change of variables, we obtain the scoring rule Note that in the case of logistic regression, this scoring rule is exactly the log score.
The perceptron scoring rule is an interesting example connected to the perceptron neural network that arises as a limiting case of eq. (15). Letting F (ζ) = 1{ζ > 0}, then

Estimating equations
The system of (unbiased) estimating equations (Dawid and Lauritzen, 2005) for where x iα denotes the α-component of feature vector x i and α = 1, . . . , p. The simple form of these equations has useful consequences for back propagation in neural net-type applications.

Deterministic classification and connection to decision theory
In some challenges, the organizers require definite class predictions and will rank the competitors in terms of a loss function L(y, y ), where y = (y i ) and y i is the predicted class for test case i. The question is then how to turn a probabilistic classification into a deterministic classification. The obvious approach is by thresholding: for some threshold s ∈ [0, 1]. Another approach is to suppose the probabilities ω are the basis for the randomized classification (Recall this is shorthand for Y i ind ∼ Bern(ω i ), i = 1, . . . , n.) We now show that neither approach corresponds to a proper scoring rule but that there is, nevertheless, a linear scoring rule naturally associated with the loss function L(y, y ).
Following Hand (Hand, 2009), let c ∈ [0, ∞] denote the cost of misclassifying an object that is in class . If we assume that there is no cost in correctly assigning an object to its class and that the loss is additive in the cases, then where 1 = (1, . . . , 1).

Thresholding
Under thresholding, the implied scoring rule is S( The associated entropy is therefore We now show that for s = c 0 /(c 0 + c 1 ), the entropy is not a continuous function of ω, and hence cannot be a generator of a proper scoring rule. Choose i and compare the left and right limits as ω i → s, with ω otherwise fixed. Then H| + − = c 0 (1 − s) − c 1 s = 0. The case s = c 0 /(c 0 + c 1 ) is a special case that we will return to shortly.

Random classification
Randomized classification implies the scoring rule S(y, We can see that this is not a proper scoring rule in two different ways. The more direct way is via the implied entropy: and this actually generates the Brier scoring rule and not S(y, ω) above. The more explicit way comes from noting that the divergence can be negative. For if we have π i = 0, 1 for some i, then there exists = ( i ) such that π ± are interior points of [0, 1] n , and it follows that d(π, π ± ) will be of opposite sign. Grünwald and Dawid (Grünwald and Dawid, 2004) give a decision theoretic approach for turning any loss function into a proper scoring rule. The key is to consider optimal acts in light of the expected loss, where the expectation is over possible outcomes y. In this formulation, y is the act of classification. Again overloading the notation, the expected loss is L(π, y ) = c 0 y · (1 − π) + c 1 (1 − y ) · π. Then the Bayes act a π against π is the choice

Proper scoring rule
which ensures L(π, y ) ≥ L(π, a π ). Following Grünwald and Dawid, we have that is a proper scoring rule. Given the previous discussion, we immediately see that this corresponds to converting the probabilistic classifier into a deterministic classifier by choosing the particular threshold s = c 0 /(c 0 + c 1 ). This is the only threshold that is appropriate. The entropy associated with the scoring rule is which is continuous and concave since

Discriminative ability and robustness: choosing the best scoring rule
Given the large number of linear scoring rules that could be used, it is natural to wonder whether there is an optimal scoring rule or a set of criteria for selecting an appropriate proper scoring rule. We argue in this section that all linear scoring rules are essentially on a par when it comes to balancing the requirements of the organizers and the competitors.
Let ω be a competitor's probabilistic classification and π = E Y ∼p Y , the true marginal distribution resulting from p ∈ P Y . Organizers will value discriminative power in the scoring rule, i.e. the ability to discriminate between classifications that are "close to" π. This will be achieved if (E Y ∼p [S(Y, ω) − S(Y, π)]) 2 is large. On the other hand, contestants will not want their score to be sensitive to outliers or unusual cases, i.e. the scoring rule should have a degree of robustness. This will be achieved if var Y ∼p [S(Y, ω) − S(Y, π)] is small. These two desiderata can be combined by seeking to maximize Importantly, this combination is invariant under a multiplicative rescaling of the scoring rule.
We now argue that all linear scoring rules have similar discriminatory and robustness properties, at least for predictions close to the truth. The usual method of Lagrange multipliers shows that the objective function achieves its worst case when η is an eigenvector of Σ∇ 2 H(π), where it evaluates to 1 4 2 η Σ −1 η. Thus the best worst-case scenario is controlled by the data-generating distribution alone -specifically the smallest eigenvalue of Σ −1 -and cannot be targeted by any linear scoring rule.

Discussion and future work
We have introduced linear scoring rules that can be used in binary classification challenges that call for a vector of class probabilities. We have illustrated important sub-classes of these scoring rules and have shown that they balance the needs of the organizers and the contestants. We have also shown how linear scoring rules can be used to train a classifier. An important question for future work is, given the scoring rule that will be used on the test cases, what is the optimal way to train the classifier?