On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

We study the interpretability of conditional probability estimates for binary classification under the agnostic setting or scenario. Under the agnostic setting, conditional probability estimates do not necessarily reflect the true conditional probabilities. Instead, they have a certain calibration property: among all data points that the classifier has predicted P(Y = 1|X) = p, p portion of them actually have label Y = 1. For cost-sensitive decision problems, this calibration property provides adequate support for us to use Bayes Decision Theory. In this paper, we define a novel measure for the calibration property together with its empirical counterpart, and prove an uniform convergence result between them. This new measure enables us to formally justify the calibration property of conditional probability estimations, and provides new insights on the problem of estimating and calibrating conditional probabilities.


Introduction
Many binary classification algorithms, such as naive Bayes and logistic regression, naturally produce confidence measures in the form of conditional probability of labels.These confidence measures are usually interpreted as the conditional probability of the label y = 1 given the feature x.An important research question is how to justify these conditional probabilities, i.e., how to prove the trustworthiness of such results.
In classical statistics, this question is usually studied under the realizable assumption, which assumes that the true underlying probability distribution has the same parametric form as the model assumption.More explicitly, statisticians usually construct a parametric conditional distribution P(Y |X, θ), and assume that the true conditional distribution is also of this form (with unknown θ).The justification of conditional probabilities can then be achieved by using either hypothesis testing or confidence interval estimation on θ.
However, in modern data analysis workflows, the realizable assumption is often violated, e.g.data analysts usually try out several off-the-shelf classification algorithms to identify those that work the best.This setting is often called agnostic -essentially implying that we do not have any knowledge about the underlying distribution.Under the agnostic setting, conditional probability estimates can no longer be justified by standard statistical tools, as most hypothesis testing methods are designed to distinguish two parameter areas in the hypothesis space (e.g., θ < θ 0 v.s.θ ≥ θ 0 ), and confidence intervals require realizable assumption to be interpretable.
In this paper, we study the interpretability of conditional probabilities in binary classification in the agnostic setting: what kind of guarantees can we have without making any assumption on the underlying distribution?Justifying these conditional probabilities is important for applications that explicitly utilize the conditional probability estimates of the labels, including medical diagnostic systems (Cooper, 1984) and fraud detection (Fawcett and Provost, 1997).In such applications, the misclassification loss function is often asymmetric (i.e., false positive and false negative incur different loss), and accurate conditional probability estimates are crucial empirically.In particular, in medical diagnostic systems, a false positive means additional tests are needed, while a false negative could potentially be fatal.
• X denotes the discrete feature space and Y = {±1} denotes the label space.
• P denotes the underlying distribution over X × Y that governs the generation of datasets.
• A fuzzy classifier is a function from X to [0, 1] where the output denotes the estimated conditional probability of P(Y = 1|X).

Interpretations of Conditional Probability Estimates
Ideally, we hope that our conditional probability estimates can be interpreted as the true conditional probabilities.This interpretation is justified if we can prove that the conditional probability estimates are close to the true values.Let l 1 (f, P) be the l 1 distance between the true distribution and the estimated distribution as a measure of the "correctness" of conditional probability estimates: Here X is a random variable representing the feature vector of a sample data point, Y is the label of X and f (X) is a fuzzy classifier that estimates P(Y = 1|X).
If we can prove that l 1 (f, P) ≤ for some small , then the output of f can be approximately interpreted as the true conditional probability.
Unfortunately, as we will show in this paper, it is impossible to guarantee any reasonably small upper bound for l 1 (f, P) under the agnostic assumption.In fact, as we will demonstrate in this paper, for the cases where we have to make the agnostic assumption, the estimated conditional probabilities are usually no longer close to the true values in practice.
Therefore, instead of trying to bound the l 1 distance, we develop an alternative interpretation for these conditional probability estimates.We introduce the following calibration definition for fuzzy classifiers: Definition 1.Let X be the feature space, Y = {±1} be the label space and P be the distribution over X ×Y.Let f : X → [0, 1] be a fuzzy classifier, then we say f is calibrated if for any p 1 < p 2 , we have: Intuitively, a fuzzy classifier is calibrated if its output correctly reflects the relative frequency of labels among instances they believe to be similar.For instance, suppose the classifier output f (X) = p for n data points, then roughly there are np data points with label Y = 1.We also define a measure of how close f is to be calibrated: Note that the empirical calibration measure c emp (f, D) can be efficiently computed on a finite dataset.We further prove that under certain conditions, c emp (f, D) converges uniformly to c(f ) over all functions f in a hypothesis class.Therefore, the calibration property of these classifiers can be demonstrated by showing that they are empirically calibrated on the training data.
The calibration definition is motivated by analyzing the properties of commonly used conditional probability estimation algorithms: many such algorithms will generate classifiers that are naturally calibrated.
Our calibration property justifies the common practice of using calibrated conditional probability estimates as true conditional probabilities: we show that if the fuzzy classifier is calibrated and the output of the classifier is the only source of information, then the optimal strategy is to apply Bayes Decision Rule on the conditional probability estimates.
The uniform convergence result of c emp (f, D) and c(f ) has several applications.First, it can be directly used to prove a fuzzy classifier is (almost) calibrated, which is necessary for the conditional probability estimates to be interpretable.Second, it suggests that we need to minimize the empirical calibration measure to obtain calibrated classifiers, which is a new direction for designing conditional probability estimation algorithms.Finally, taking an uncalibrated conditional probability estimates as input, we can calibrate them by minimizing the calibration measure.In fact, one of the most well-known calibration algorithm, the isotonic regression algorithm, can be interpreted this way.

Paper Outline
The rest of this paper is organized as following.In Section 2, we argue that the l 1 distance cannot be provably bounded under the agnostic assumption (Theorem 1) and then motivate our calibration definition.
In Section 3 we present the uniform convergence result (Theorem 2) and discuss the potential applications.In Section 4, we conduct experiments to illustrate the behavior of our calibration measure on several common classification algorithms.

Related Work
Our definition of calibration is similar to the definition of calibration in prediction theory (Foster and Vohra, 1998), where the goal is also to make predicted probability values match the relative frequency of correct predictions.In prediction theory, the problem is formulated from a game-theoretic point of view: the sequence generator is assumed to be malevolent, and the goal is to design algorithms to achieve this calibration guarantee no matter what strategy the sequence generator uses.
To the best of our knowledge, there is no other work addressing the interpretability of conditional probability estimates in agnostic cases.Our definition of calibration is also connected to the problem of calibrating conditional probability estimates, which has been studied in many papers (Zadrozny and Elkan, 2002) (Platt, 1999).Recall that the l 1 distance between f and P is defined as: Suppose f is our conditional probability estimator that we learned from the training dataset.We attempt to prove that the l 1 distance between f and P is small under the agnostic setting.With the agnostic setting, we do not know anything about P, and the only tool we can utilize is a validation dataset D val that consists of i.i.d.samples from P. Therefore, our best hope would be a prover A f (D) that: • Returns 1 with high probability if l 1 (f, P) is small.
• Returns 0 with high probability if l 1 (f, P) is large.
The following theorem states that no such prover exists, and the proof can be found in the appendix.Theorem 1.Let Q be a probability distribution over X , and f : X → [0, 1] be a fuzzy classifier.Define B f as: If we have that ∀x ∈ X , Q(x) < 1 10n 2 , then there is no prover A f : {X × Y} n → {0, 1} for f satisfying the following two conditions: For any P over X × Y such that P X = Q (i.e., ∀x ∈ X , y∈Y P(x, y) = Q(x)), suppose D val ∈ {X × Y} n is a validation dataset consisting of n i.i.d.samples from P: We made the assumption in Theorem 1 to exclude the scenario where a significant amount of probability mass concentrates on a few data points so that their corresponding conditional probability can be estimated via repeated sampling.Note that the statement is not true in the extreme case where all probability mass concentrates on one single data point (i.e., ∃x ∈ X, Q(x) = 1).The assumption is true when the feature space X is large enough such that it is almost impossible for any data point to have significant enough probability mass to get sampled more than once in the training dataset.
The significance of Theorem 1 is that any attempt to guarantee a small upper bound of l 1 (f, P) would definitely fail.Thus, we can no longer interpret the conditional probability estimates as the true conditional probabilities under the agnostic setting.This result motivates us to develop a new measure of "correctness" to justify the conditional probability estimates.

l 1 (f, P) in practice
The fact that we cannot guarantee an upper bound of the l 1 distance is not merely a theoretical artifact.In fact, in the cases where we need to make the agnostic assumption, the value of l 1 (f, P) is often very large in practice.Here we use the following document categorization example to demonstrate this point.
Example 1. Denote Z to be the collection of all English words.In this problem the feature space X = Z * is the collection of all possible word sequences, and Y denotes whether this document belongs to a certain topic (say, football).Denote P as the following data generation process: X is generate from the Latent Dirichlet Allocation model (Blei et al., 2003), and Y is chosen randomly according to the topic mixture.
We use logistic regression, which is parameterized by a weight function w : Z → R, and two additional parameters a and b.For each document X = z 1 z 2 . . .z k , the output of the classifier is: The reason that we are using automatically generated documents instead of true documents here is that the conditional probabilities P (Y |X) are directly computable (otherwise we cannot evaluate l 1 (f, P) and other measures).We conducted an experimental simulation for this example, and the experimental details can be found in the appendix.Here we summarize the major findings: the logistic regression classifier has very large l 1 error, which is probably due to the discrepancy between the logistic regression model and the underlying model.However, the logistic regression classifier is almost naturally calibrated in this example.This is not a coincidence, and we will discuss the corresponding intuition in Section 2.3.

The Motivation of the Calibration Measure
Let us revisit Example 1.This time, we fix the word weight function w.In this case, every document X can be represented using a single parameter w(X) = i w(z i ), and we search for the optimal a and b such that the log-likelihood is maximized.This is illustrated in Figure 1.P(Y = 1|w(X)).Therefore, for the optimal a and b, we could say that the following property is roughly correct: In other words, Let us examine this example more closely.The reason why the logistic regression classifier tells us that f (X) ≈ p is because of the following: among all the documents with similar weight w(X), about p portion of them actually belong to the topic in the training dataset.This leads to an important observation: logistic regression estimates the conditional probabilities by computing the relative frequency of labels among documents it believes to be similar.This behavior is not unique to logistic regression.Many other algorithms, including decision tree classifiers, nearest neighbor (NN) classifiers, and neural networks, exhibit similar behavior: • In decision trees, all data points reaching the same decision leaf are considered similar.
• In NN classifiers, all data points with the same nearest neighbors are considered similar.
• In neural networks, all data points with the same output layer node values are considered similar.
We can abstract the above conditional probability estimators as the following two-step process: 1. Partition the feature space X into several regions.

Estimate the relative frequency of labels among all data points inside each region.
The definition of the calibration property follows easily from the above two-step process.We can argue that the classifier is approximately calibrated, if for each region S in the feature space X , the output conditional probability of data points in S is close to the actual relative frequency of labels in S. The definition for the calibration property then follows from the fact that all data points inside each region have the same output conditional probabilities.

Using Calibrated Conditional Probabilities in Decision Making
The calibration property justifies the common practice of using estimated conditional probabilities in decision making.Consider the binary classification problem with assymetric misclassification loss: we lose a points for every false positive and b points for every false negative.In this case, the best decision strategy is to predict 1 if P(Y = 1|X) ≥ a a+b and predict −1 otherwise.Now consider the case when we do not know P(Y = 1|X), but only know the value of f (X) instead.If we can only use f (X) to make decision, and f is calibrated, then the best strategy is to use f (X) in the same way as P(Y = 1|X) (the proof can be found in the appendix): Claim 1. Suppose we are given a calibrated fuzzy classifier f : X → [0, 1], we need to make decisions solely based on the output of f .Denote our decision as g : [0, 1] → {±1} (i.e., our decision for X is g(f (X))).
Then the optimal strategy g * to minimize the expected loss is the following: Then we have the following result: Theorem 2. Let F be a set of fuzzy classifiers, i.e., functions from X to [0, 1].Let H be the set of binary classifiers obtained by thresholding the output of fuzzy classifiers in F: Suppose the Rademacher Complexity of H satisfies: The proof of this theorem, together with a discussion on the hypothesis class H, can be found in the appendix.

Applications of Theorem 2 3.2.1 Verifying the calibration of classifier
The first application of Theorem 2 is that we can verify whether the learned classifier f is calibrated.For simple hypothesis spaces F (e.g., logistic regression), the corresponding hypothesis space H has low Rademacher Complexity.In this case, Theorem 2 naturally guarantees the generalization of calibration measure.
There are also cases where the Rademacher Complexity of H is not small.One notable example is SVM classifiers with Platt Scaling (Platt, 1999): Let F be the following hypothesis class: The proof can be found in the appendix.In the case of SVM, the dimensionality of the feature space is usually much larger than the training dataset size (this is especially true for kernel-SVM).In this situation, we can no longer verify the calibration property using only the training data, and we have to keep a separate validation dataset to calibrate the classifier (as suggested by Platt (1999)).When verifying the calibration of classifier on a validation dataset.The hypothesis class F = {f }, and it is easy to verify that Therefore, with enough validation data, we can still bound the calibration measure.

Implications on Learning Algorithm Design
Standard conditional probability estimation usually maximizes the likelihood to find the best classifier within the hypothesis space.However, since we can only guarantee the conditional probability estimates to be calibrated under the agnostic assumption, any calibrated classifier is essentially as good as the maximum likelihood estimation in terms of interpretability.Therefore, likelihood maximization is not necessarily the only method for estimating conditional probabilities.
There are other loss functions that are already widely used for binary classification.For example, hinge loss is at the foundation of large margin classifiers.Based on our discussion in this paper, we believe that these loss functions can also be used for conditional probability estimation.For example, Theorem 2 suggests the following constrained optimization problem: where L(f, D) is the loss function we want to minimize.By optimizing over the space of empirically calibrated classifiers, we can ensure that the resulting classifier is also calibrated with respect to P.
In fact, the conditional probability estimation algorithm developed by Kakade et al. (2011) already implicitly follows this framework (more elaboration on this point can be found in the appendix).We believe that many more interesting algorithms can be developed along this direction.

Connection to the Calibration Problem
Suppose that we are given an uncalibrated fuzzy classifier f 0 : X → [0, 1], and we want to find a function g from [0, 1] to [0, 1], so that g • f 0 presents a better conditional probability estimation.This is the problem of classifier calibration, which has been studied in many papers (Zadrozny and Elkan, 2002) (Platt, 1999).
Traditionally, calibration algorithms find the best link function g by maximizing likelihood or minimizing squared loss.In this paper, we suggest a different approach to the calibration problem.We can find the best g by minimizing the empirical calibration measure c emp (g • f 0 ).Let us assume w.l.o.g. that the training dataset D = {(x 1 , y 1 ), . . ., (x n , y n )} satisfies Then we have, This expression can be used as the objective function for calibration: we search over the space of hypothesis G to find a function g that minimizes this objective function.Compared to other loss functions, the benefits of minimizing this objective function is that the resulting classifier is more likely to be calibrated, and therefore provides more interpretable conditional probability estimates.
In fact, one of the most well-known calibration algorithms, the isotonic regression algorithm, can be viewed as minimizing this objective function: Then the optimal solution that minimizes the squared loss The proof can be found in the appendix.Using this connection we proved several interesting properties of the isotonic regression algorithm, which can also be found in the appendix.

Empirical behavior of the calibration measure
In this section, we conduct some preliminary experiments to demonstrate the behavior of the calibration measure on some common algorithms.We use two binary classification datasets from the UCI Repository2 : ADULT3 and COVTYPE 4 .COVTYPE has been converted to a binary classification problem by treating the largest class as positive and the rest as negative.
Figure 2: The empirical calibration error Figure 2 shows the empirical calibration error c emp on test datasets for all methods.From the experimental results, it appears that Logistic Regression and Random Forest naturally produce calibrated classifiers, which is intuitive as we discussed in the paper.
The calibration measure of Naive Bayes seems to be depending on the dataset.For large margin methods (SVM and boosted trees), the calibration measures are high, meaning that they are not calibrated (on these two datasets).
There is also an interesting connection between the calibration error and the benefit of applying a calibration algorithm, which is illustrated in Figure 3.In this experiment, we used a loss parameter p to control the asymmetric loss: each false negative incurs 1 − p cost and each false positive incurs p cost.All the algorithms are first trained on the training dataset, then calibrated on a separate validation set of size 2000 using isotonic regression.For each algorithm, we compute the prior-calibration and post-calibration average losses on the testing dataset using the following decision rule: For each data point X, we predict Y = 1 if and only if we predict that Pr(Y = 1|X) ≥ p. Finally, we report the ratio between two losses: loss ratio = the average loss after calibration the average loss before calibration As we can see in the Figure 3, the calibration proce-  Comparing with the results in Figure 2, the two algorithms that benefit most from calibration (i.e., SVM and boosted trees) also has high empirical calibration error.This result suggests that if an algorithm already has a low calibration error to begin with, then it is not likely to benefit much from the calibration process.This finding could potentially help us decide whether we need to calibrate the current classifier using isotonic regression (Niculescu-Mizil and Caruana, 2005).

Conclusion
In this paper, we discussed the interpretability of conditional probability estimates under the agnostic assumption.We proved that it is impossible to upper bound the l 1 error of conditional probability estimates under such scenario.Instead, we defined a novel measure of calibration to provide interpretability for conditional probability estimates.The uniform convergence result between the measure and its empirical counterpart allows us to empirically verify the calibration property without making any assumption on the underlying distribution: the classifier is (almost) calibrated if and only if the empirical calibration measure is low.Our result provides new insights on conditional probability estimation: ensuring empirical calibration is already sufficient for providing interpretable conditional probability estimates, and thus many other loss functions (e.g., hinge loss) can also be utilized for es-timating conditional probabilities.

Appendix Experimental Simulation of Example 1
Here we experimentally simulate Example 1 to illustrate that logistic regression classifier has large l 1 error.We use Latent Dirichlet Allocation (LDA) (Blei et al., 2003), the state of the art generative model for documents, to generate datasets.The detailed experiment settings are listed below: • The dataset consists of 20000 documents, the number of topics is 20, the dictionary size is 1000, and the average number of words in each document is 200.
• We use the non-informative Dirichlet prior α = (1, 1, . . ., 1) over topics.The word distribution in each topic follows power law with a random order among words.
• For each document, we randomly sample with replacement 10 topic labels from the topic distribution.
Table 1 reports the mean experiment results and the standard deviation across five runs.For reference we also include the relative frequency of labels, and the l 1 error achieved by the trivial classifier that always output the global relative frequency of labels as conditional probability.
2. With probability at least 1 − δ 2 , a dataset D with n i.i.d.samples from P satisfies: Then there exists another probability distribution P such that: 1.With probability at least 1−δ 1 −δ 2 , a data D with n i.i.d.samples from P will also pass V .
Proof.First we construct the following distribution over all possible P satisfying the last two conditions: where Q(p , p) is defined as: Now it is sufficient to show that if we sample P according to the above distribution and then sample D from P , then with probability at least 1 − δ 1 − δ 2 , D will pass V .Assuming this is true, then at least one distribution P have to satisfy the first condition, and thereby proved the existence of P .
To compute the probability that D would pass V , denote Note that all P has the same marginal distribution over X , therefore: We only consider all those D X with distinct X i values.Based on the assumption, such D X accounts for at least 1−δ 2 of the probability mass.Now the important observation is that for every fixed D X with distinct X values, the marginal distribution of D Y given D X (i.e.marginalize over P ) is exactly P(D Y |D X ), the distribution that we sample labels independently from P(Y |X) for each X i in D X : The latter probability is actually the probability that D will pass V and have distinct X values at the same time.Based on the assumptions in the lemma, it occurs with probability at least 1 − δ 1 − δ 2 .Now given this lemma, the proof of Theorem 1 is easy: We show that if any prover A f satisfies the two conditions in the theorem, it can be used as the verifier V in the lemma such that no P can satisfy all three conditions.
Let δ 1 = 1 3 , then the first assumption in the lemma is satisfied, also since ∀x ∈ X , Q(x) < 1 10n 2 , we have: By a union bound, we have: Therefore we can set δ 2 = 0.1.By the above lemma, there exists another P such that and On the other hand, note that the l 1 distance between P and P is at least B, then by the properties of A f , D cannot pass A f with probability greater than 1 3 .This contradicts our earlier result.Therefore no such A f can exist.

Proof of Claim 1
Proof.The expected loss is Define S = {f (X) : X ∈ X }, then we have: Therefore, the optimal g * has g * (p) = 1 if and only if: Which is equivalent as:

Proof of Theorem 2
Proof.We will use the following uniform convergence result (Shalev-Shwartz and Ben-David, 2014): Theorem 3. Let D be i.i.d.samples of (X × Y, P), then with probability at least 1 − δ, In the following we sometimes allow G to be a collection of functions from X to [0, 1] in the above results.When used in this sense, we assume that the function will not use y label: g(x, y) = g(x).
Define F D,p1,p2 (f ) to be the relative frequency of event {p 1 < f (x) ≤ p 2 , y = 1}: Define F P,p1,p2 (f ) to be the probability of the same event: Define E D,p1,p2 (f ) as the empirical expectation of f (x)1 p1<f (x)≤p2 : Define E P,p1,p2 (f ) as the expectation of the same function: When the context is clear, subscripts p 1 and p 2 can be dropped.Using these notations, we can rewrite c(f ) and c emp (f, D) as follows: Therefore it suffices to show that Then we have the following lemma: Lemma 2. Let H 1 , H 2 as defined above, then: Proof.For R D (H 1 ), we have: where the last step is because t i = σ i max(z i , y i ) is uniformly distributed over {±1} independent of the value of y i .
For R D (H 2 ), we have: where the second step is due to f (x) = 1 0 1 t<f (x) dt, and the forth step is just substituting max(p 1 , t) with p 1 .Since there is no constraint on p 1 , the p 1 can take any value greater than or equal to t.
Combining this lemma with the assumptions in the theorem: By Equation (1): Proof of Claim 2 Proof.For any σ ∈ {±1} n , we can find a vector w such that for every X i , we have w T X i = σ i (this is always possible since the number of equations n is less than the dimensionality d).Let w * = Bw ||w||2 so that ||w * || 2 = B, and let a = λ||w|| 2 /B and b = 0. Then 1.For i = 0, . . ., n, Compute P i = (i, S i = j≤i 1 yj =1 ) 2. Let cv(P ) be the convex hull of the set of points P i 3.For i = 0, . . ., n, Let Z i = intersection of cv(P ) and the line x = i 5. Let g(f 0 (x i )) = z i , extrapolate these points to get continuous nondecreasing function g.
Algorithm 1: Isotonic Regression Calibration Algorithm (PAV Algorithm) we have: We remark that H is different from the hypothesis class H p1,p2 , where the thresholds are fixed in advance: In general, the gap between the Rademacher Complexities of H 0 and H p1,p2 can be arbitrarily large.The following example illustrates this point.Example 2. Let X = {1, . . ., n}, and A 1 , A 2 , . . ., A 2 n be a sequence of sets containing all subsets of X .Let H be the following hypothesis space: Combining the last two inequalities, we get the desired result.

Proof of Claim 3
Proof.For reference, the pseudo-code of the PAV algorithm for isotonic regression (Niculescu-Mizil and Caruana, 2005) can be found in Algorithm 1.
Let z i = g(f 0 (x i )), then we can rewrite the objective function as: To prove Algorithm 1 also minimizes this objective function, we first state the minimization problem as a linear programming: Then we have the following constraints: Let Z * i be the solution produced by Algorithm 1, it should be obvious that Z * i ≤ S i for all i.Therefore, We need to prove that ξ * 1 ≤ ξ 1 + ξ 2 for every feasible solution (Z i , ξ i ).Suppose ξ * 1 = 1 n (S k −Z * i ), and Z * i lies on the line segment {(j, S j ), (k, S k )}.Then we have: Because of the convexity constraint of Z, it must satisfy the following inequality: Computing the difference between these two, we get We get nξ * 1 ≤ nξ 1 + nξ 2 which proves the optimality of Z * .

Properties of Isotonic Regression
We can prove several interesting properties of isotonic regression using Theorem 2. Claim 5. Let g * be the calibrating function produced by Algorithm 1, then: of the calibrated classifier is always 0.
2. For any asymmetric loss (1 − p, p) (i.e., each false negative incurs 1 − p cost and each false positive incurs p cost), the empirical loss of the calibrated classifier is always no greater than the original classifier (both using the optimal decision threshold p): In particular, when p = 0.5, the empirical accuracy of the calibrated classifier is always greater than or equal to the empirical accuracy of the original classifier.
Proof.Throughout the proof, let C be the convex hull computed in Algorithm 1: We will use the following notations: 1.For any p 1 , p 2 , let l, r be such that: If no such k exists, let l, r be 0 respectively.By Algorithm 1, we have Thus we have (l, S l ), (r, S r ) ∈ C, Z l = S l , Z r = S r , and therefore We consider two separate cases: We can also use Theorem 2 to derive the following non-asymptotic convergence result of Algorithm 1.
Claim 6.Let F (t) = P(f 0 (X) ≤ t) be the distribution function of f 0 (X), and define G(t) as: Let us explain the intuition behind this claim: F (t) is the percentage of data points satisfying f 0 (X) ≤ t, and G(t) is F (t) times the conditional probability of Y = 1 in the region {f 0 (X) ≤ t}.Now consider points P i = (i, S i ) in Algorithm 1, it is not hard to show that as n → ∞, the limit of points P i are the curve (F (t), G(t)), t ∈ [0, 1] (after proper scaling).Similarly, G e (t) is F (t) times the expected value of g * (f 0 (X)) in the region {f 0 (X) ≤ t}, and it is not hard to show that (F (t), G e (t)) is the limit of (i, Z i ) (after proper scaling).Now the claim states that in the PAV algorithm, (F (t), G e (t)) converge uniformly to the convex hull of (F (t), G(t)), which should not be surprising, since we explicitly computed the convex hull of {P i } in Algorithm 1.
When P(Y = 1|f 0 (X)) is monotonically increasing w.r.t.f 0 (X), (F (t), G(t)) is convex, and Claim 6 immediately implies that G e (t) will converge uniformly to G(t).In this case, the PAV algorithm will eventually recover the "true" link function g * (f 0 (X)) = P(Y = 1|f 0 (X)) given sufficient training samples, and Claim 6 provides a rough estimate of the number of samples required to achieve the desired precision.

Figure 1 :
Figure 1: Illustration of Example 1 a, b ∈ R} If the training data size n < d and the training data X i are linearly independent, then R D (H) = 1 2 .

Figure 3 :
Figure3: The loss ratio on two datasets dure on average reduces the cost by 3%-5% for naive Bayes and random forest, 20% for SVM, 12% for boosted trees, and close to 0% for logistic regression.Comparing with the results in Figure2, the two algorithms that benefit most from calibration (i.e., SVM and boosted trees) also has high empirical calibration error.This result suggests that if an algorithm already has a low calibration error to begin with, then it is not likely to benefit much from the calibration process.This finding could potentially help us decide whether we need to calibrate the current classifier using isotonic regression (Niculescu-Mizil and Caruana, 2005).

n i=1 1
σi=1 , and the conclusion of the claim follows easily.The Hypothesis Class H In Theorem 2, H is the collection of binary classifiers obtained by thresholding the output of a fuzzy classifier in F. For many hypothesis classes F, the Rademacher Complexity of H can be naturally bounded.For instance, if F is the d-dimensional generalized linear classifiers with monotone link function, then E D R D (H) can be bounded by O( d log n/n).
(a) a ≤ b, in this case we only need to show that b i=a+1 [p1 yi=0 − (1 − p)1 yi=1 ] ≥ 0 or equivalently, p[(b − a) − (S b − S a )] − (1 − p)(S b − S a ) ≥ 0 Rearrange terms, it suffices to show p(b − a) − (S b − S a ) ≥ 0 Since S b = Z b , S a ≥ Z a S b − S a ≤ Z b − Z a ≤ z b (b − a) ≤ p(b − a) (b) a > b, in this case we only need to show a i=b+1 [p1 yi=0 − (1 − p)1 yi=1 ] ≤ 0 or equivalently, p[(a − b) − (S a − S b )] − (1 − p)(S a − S b ) ≤ 0 Rearrange terms, it suffices to show p(a − b) − (S a − S b ) ≤ 0 Since S b = Z b , S a ≥ Z a S a − S b ≥ Z a − Z b ≥ z b+1 (a − b) ≥ p(a − b)

Table 1 :
L 1 error and empirical calibration As we can see from Table1, the logistic regression only achieves 0.13 average l 1 error, while even the trivial classifier can achieve 0.2.This implies that logistic regression performed very badly in this example.
Proof.The proof relies on the following lemma: Lemma 1.Let P be a distribution over X × Y. Let D be a size n i.i.d.sample set from P. Let V be a verifier of P given D (i.e., V is a function from {X × Y} n to {0, 1}), such that 1.With probability at least 1 − δ 1 , a dataset D with n i.i.d.samples from P will pass V : (Vapnik and Chervonenkis, 1971F contains 2 n classifiers, the ith classifier produces a output of either i 2 n or i 2 n − 1 2 n+1 depending on whether x ∈ A i .One can easily verify that for any p 1 , p 2 , the VC-dimension(Vapnik and Chervonenkis, 1971) of H p1,p2 is at most 2, but the VC-dimension of H is n.However, if for any x ∈ X , f ∈ F, we have f (x) ∈ P * with |P * | < ∞, then R D (H) can be bounded using the maximum VC-dimension of H p1,p2 and log |P * |: Claim 4. If for any f ∈ F, x ∈ X , we have f (x) ∈ P * where P * is a finite set, and for all p 1 , p 2 ∈ R, the VCdimension of hypothesis space H p1,p2 is at most d, then for any sample D of size n with n > d + 1 we have:Since f (x) only takes finite possible values, we only need to consider values of p 1 , p 2 in P * ∪ {−∞}.Therefore by union bound we have|H(S n )| ≤ p1,p2∈P * ∪{−∞} |H p1,p2 (S n )|Since each H p1,p2 has VC-dimension at most d, by Sauer's Lemma (Shalev-Shwartz and Ben-David, 2014): ∀p 1 , p 2 , |H p1,p2 (S n )| ≤ (en/d) d