1 Introduction

There is an increasing trend in using Deep Neural Networks (DNNs) to automate a multitude of tasks, like image classification for healthcare [1] and speech recognition [2] among many others. Some of these are high-risk applications in nature, for example, a False Negative given by a DNN used for cancer detection could be fatal for the patient. Hence it is of paramount importance to use reliable Machine Learning (ML) systems that acknowledge the uncertainty of their predictions. A probabilistic classifier that outputs a confidence value, or probability, for each class, allows to make Bayes decisions---i.e., optimum decisions leveraging the cost of such decisions [3]. So, by applying the Bayesian framework we can readily account for such uncertainty, assuming that the system produces calibrated confidence values.

The extent to which the confidence outputs of a classifier can be interpreted as class probabilities is what is known as the calibration of a classifier [4, 5]. Modern DNNs achieve very low test error rates but are not necessarily well-calibrated [6, 7]. Hence, there is a growing interest from the ML community toward improving the calibration of DNNs.

1.1 Related work

One approach to obtain better confidence estimates is to average the predictions of different models using ensembles [8] or taking a Bayesian approach to model learning [9, 10]. Data Augmentation techniques have also been used to improve calibration [11,12,13], as well as modified training objectives [14,15,16]. Among popular approaches, and the focus of this work, is the approach of post-hoc calibration, in which the predictions of an already trained classifier are re-calibrated (Fig. 1). Typically, other model, the calibrator, is trained on the outputs of the classifier to calibrate. This approach is very convenient since one can take advantage of the existing literature and use off-the-shelf ML systems that present all the desirable properties that make them so popular. Deep Learning models usually offer a good solution for any Machine Learning task and, for this reason, DNNs have become standard models with an easy application via public frameworks like Pytorch [17] and Tensorflow [18]. Post-hoc calibration allow us to use the existing stack to solve high-risk tasks and address over-confidence issues with little overhead.

Fig. 1
figure 1

Post-hoc calibration

Probably the most popular post-hoc calibration method for DNNs is Temperature Scaling (TS) proposed by [6]. It is a single parameter model that re-scales the confidence predictions by a temperature factor. The simplicity of this method and the fact that it seems to perform even better than more complex ones, led the authors to believe that the problem of re-calibration is inherently simple. However, recent alternatives based on expressive models like Bayesian Neural Networks [19] and Gaussian Processes [20] outperform TS, suggesting that re-calibration might be a more complex problem than it was previously assumed. On the other hand, expressive models can be more data-hungry and may require careful tuning when the amount of data is limited. Hence, TS still represents the standard calibration method as it yields a good trade-off between simplicity and performance, especially when recalibration data is limited.

Based on the observation that miscalibration on modern DNNs is often caused by over-confidence [6, 21], recent work proposes to learn more complex calibration functions than TS but from a constrained space by imposing some restrictions, like being accuracy-preserving [22] and order-invariant [23], inducing a bias toward the desired calibration functions. This approach shows promising results, but it may still fail in low-data scenarios, especially when using over-parameterized models. This can be a huge limitation in tasks where data for calibration is usually scarce, like certain language recognition tasks where some languages can be underrepresented [24, 25], or in the medical diagnosis of very rare diseases [26]. Thus, there is a need for calibration methods with low data requirements in many real applications.

We propose to use a simple model with a suitable inductive bias for the following reasons: First, the set of possible calibration functions that the model can learn---i.e., the hypothesis space---is reduced. This translates into an easier training objective requiring less tuning. Moreover, if the bias is well-specified, the learned calibration function will be more robust against a lack of training data, and will better generalize to other data [27]. The quality of the inductive bias depends on the knowledge we have of the task at hand. For example, the specific architecture of Convolutional Neural Networks (CNNs), based on convolution filters, explains their success on visual recognition tasks [28], even though by sharing weights the total number of parameters is reduced, constraining the learning space.

1.2 Contributions

To gain knowledge about the calibration of modern DNNs, we provide a study of post-hoc calibration methods. We analyze several state-of-the-art calibrators of varying degrees of expressiveness and robustness to help design models more resilient to data scarcity. We focus on the problem of confidence scaling as the bad calibration properties of DNNs are mainly attributed to over-confidence [6].

To perform this study we focus on Adaptive Temperature Scaling (ATS) methods, a family of calibration maps that generalizes TS by making the temperature factor input-dependent. This idea has already been proposed before in the literature [29], but the ATS method differs from those in the specific input-dependency. Previous methods propose to estimate the temperature factor as a function of the classifier input. ATS models, on the other hand, compute temperature factors directly from the output of the classifier via a temperature function. Within this family, we can compare several calibration methods which extend the expressiveness of TS in different ways.

We analyze and benchmark several calibration models focusing on the shape of those temperature functions that lead to better calibration. Results show that highly parameterized methods achieve high performance when there is plenty of data, but also that these are doomed to failure in low-data scenarios. By examining the behavior of expressive methods on ideal conditions, we notice that their temperature function shows a dependency between the entropy of a prediction and its degree of overconfidence. Based on this gained knowledge about the post-hoc calibration task, we develop Entropy-based Temperature Scaling (HTS), a method with a strong inductive bias that is robust to the size of the dataset and provides comparable performance to other state-of-the-art methods. We provide an interpretation of the method based on the use of entropy as a measure of uncertainty. This interpretation helps to understand the relation between overconfidence and uncertainty estimation.

The rest of the paper is organized as follows. First, we introduce some theoretical background of the calibration task. In Sect. 3 we propose some post-hoc calibration methods motivating their design, and also describe other existing techniques to which we compare our methods. Next, in Sect. 4 we describe the performed experiments, point out some observations, and show the results. Finally, in the last section, we give our conclusions and comment on possible future work.

2 Background

In this work we focus on the calibration of classifiers, so before delving into the concept of calibration we introduce the following notation to describe the multi-class classification task. Let \(x \sim X \in \mathcal {X}\) be the input random variable with associated target \(y \sim Y \in \mathcal {Y}\), where \(y = [y_1, y_2,.., y_K]\) is a one-hot encoded label. The goal is to obtain a probabilistic model f for the conditional distribution \(P(Y|X=x)\). Notice that, since Y is a categorical vector encoding the true class, any distribution P(Y) on Y follows a categorical distribution. The model defines the function \(f(x) = z, \, x \in \mathcal {X}, \, z \in \mathbb {R}^K\). The outputs z of the model are known as logits since they are later mapped to probability vectors via the softmax function:

$$\begin{aligned} q = \sigma _\mathrm{{{SM}}}(z) = \frac{\exp z}{\sum _{k=1}^K \exp z_k}, \end{aligned}$$
(1)

where the exponential in the numerator is applied element-wise. The output \(q \in \mathbb {S}^K\) is the corresponding probability vector that lies in the probability simplex in K classes \(\mathbb {S}^K\) and \(q_k\) is the model predicted value for \(P(y_k | x)\).

In practice there is no distribution P(XY) (or we do not have access to it). Instead, we have a labeled data set \(\mathcal {D}\) of N pair-realizations \(\mathcal {D} = \{x^{(i)}, y^{(i)}\}_{i=1}^N\) that is used to approximate it. For example, DNNs are normally trained by minimizing the expected value of some loss function \(\mathcal {L}(f(x), y)\), over the empirical distribution induced by placing a Dirac’s delta at each point \(\mathcal {D}\):

$$\begin{aligned} \sum _{(x, y) \in \mathcal {D}}\mathcal {L}(f(x), y). \end{aligned}$$

2.1 Calibration

A probabilistic classifier is said to be well-calibrated whenever its confidence predictions for a given class match the chances of that class being the correct one [4, 5]. We can express this property as an equation in terms of the probability distributions introduced earlier:

$$\begin{aligned} P(y\,|\,q) = q, \,\, \forall q \in \mathbb {S}^K, \end{aligned}$$
(2)

where \(P(y\,|\,q)\) represents the relative class frequency ---i.e., the proportion of each class on the set of all possible realizations of x for which the classifier predicts q.

From this expression, it is easy to derive a measure of miscalibration or Calibration Error (CE):

$$\begin{aligned} \mathrm{{CE}} = \mathbb {E}_{P(X, Y)}\big [ \left\Vert P(y\,|\,q) - q\right\Vert _d\big ]. \end{aligned}$$
(3)

This is, the expected value of the d-norm of the difference between prediction vectors and the relative class proportions.

While this equation might be useful to illustrate the concept of miscalibration, it does not provide a feasible way to measure it. First, we cannot compute the expected value w.r.t. the non-existent distribution P(XY), and, more importantly, there is no simple way of evaluating \(P(y\,|\,q)\). The former can be readily solved by using the labeled dataset \(\mathcal {D}\) as MC samples, yet the later is still a main limitation. Thus, further approximations are required to estimate the miscalibration of a classifier.

2.1.1 ECE

The most popular metric used to estimate the Calibration Error is the Expected Calibration Error (ECE) [6, 30]. This metric uses a histogram approach to model \(P(y\,|\,q)\) and considers only top-label predictions---i.e., \(\max (q)\). The samples of a given evaluation set \(\mathcal {D}_\mathrm{{{test}}}\) are partitioned into M bins \({B_1, B_2,..., B_M}\) according to the confidence of their top prediction:

$$\begin{aligned} B_i {:}{=} \Big \{(x, y) \in \mathcal {D}_\mathrm{{{test}}} :\; \frac{i-1}{M} < \max (q) \le \frac{i}{M} \Big \}. \end{aligned}$$

Then the ECE is computed as:

$$\begin{aligned} \mathrm{{ECE}} = \sum _{i=1}^M \frac{|B_i|}{|\mathcal {D}_\mathrm{{{test}}}|}\,|\mathrm{{acc}}(B_i) - \mathrm{{cof}}(B_i)|, \end{aligned}$$
(4)

where \(|\cdot |\) denotes the number of samples in a set, \(\mathrm{{acc}}(B_i)\) is the accuracy of the classifier evaluated only on \(B_i\), and \(\mathrm{{cof}}(B_i)\) is the mean confidence of the top-label predictions in \(B_i\).

Despite its popularity, this estimator provides unreliable results as it is biased and noisy [31,32,33]. Many improvements over the ECE have been proposed to mitigate these problems such as class-wise ECE and using variable confidence intervals [33]. However, there is not any binning scheme consistently reliable [34]. Anyway, ECE remains the most popular metric used by the community to measure miscalibration and we use it in our experiments to report results for the sake of comparison.

2.1.2 Proper scoring rules

One way to implicitly measure calibration is to use Proper Scoring Rules (PSRs). Any PSR can be decomposed into the sum of two terms [35], a refinement term and the so-called reliability or calibration. Thus, when evaluating the goodness of a classifier with a Proper Scoring Rule, one is also indirectly measuring calibration. The fact that the calibration component cannot be evaluated in isolation is what drives the community to use approximated metrics like ECE. Moreover, different PSRs may rank differently the same set of systems evaluated on the same data. Nevertheless, PSRs provide a theoretically grounded way of measuring the goodness of a classifier. Throughout this work, we use two different PSRs to evaluate models, the log-score or Negative Log-Likelihood (NLL) and the Brier score, both of them well-known [36].

2.1.3 Entropy

Because the output of a probabilistic classifier represents a categorical distribution, the entropy of a prediction vector q is defined as:

$$\begin{aligned} H(q) = -\sum _{k=1}^{K} q_k \log q_k. \end{aligned}$$
(5)

It is easy to check that entropy reaches its maximum of \(H(q) = - \log 1/K = \log K\) at \(q = [1/K, 1/K,..., 1/K]\). Likewise, it is minimized at the vertices of the probability simplex---i.e., \(q_k = 1\) for any class k---where it takes a value \(H(q) = 0\). From this behavior it follows naturally the interpretation of the entropy of the predictive class distribution as a measure for uncertainty quantification [37, 38].

The ECE metric considers only the confidence value assigned to the top-rated class. This value represents the class probability estimated by the classifier. While it is a confidence value, it does not represent the ‘confidence’ of the classifier on the prediction, it just concerns the predicted class in particular. Conversely, the entropy of the prediction vector is a measure of uncertainty of the whole prediction---i.e., an alternative and more comprehensive way of assessing the classifier’s confidence in some prediction.

For instance, we may have two predictions \(q^{(i)} = [0.6, 0.2, 0.2]\) and \(q^{(j)} = [0.6, 0.4, 0.0]\) in a 3-class problem. Both assign the same confidence 0.6 to class 1, but it is clear that \(q^{(i)}\) is a higher entropy predictive than \(q^{(j)}\)---i.e., it is a more uncertain prediction.

2.2 Post-hoc calibration

Ideally, a model f trained on some data \(\mathcal {D}\) would generalize and show good calibration properties when evaluated on other data \(\mathcal {D}_\mathrm{{{test}}}\), assuming both sets are reasonably similar. However, many classification systems turn out to be badly calibrated in practice; for instance, Convolutional Neural Networks (CNNs) tend to produce overconfident predictions [6, 21]. Moreover, in some tasks, it cannot be guaranteed that the training data is similar enough to the actual data on which the model will be deployed. For instance, a language recognition system may be trained on broadcast narrowband speech (BNBS) data but applied in a telephone service where the audio characteristics are different. To solve this problem, one common approach is that of post-hoc calibration, in which a function is applied to the outputs of the model. This function can be seen as a decoupled classifier that learns to map uncalibrated outputs to calibrated ones---i.e., \(q \mapsto \hat{q}\). We use the \(\hat{\cdot }\) notation to denote the calibrated prediction. The standard practice is to fit this calibration map, or simply calibrator, in a held-out data set \(\mathcal {D}_{cal}\), also called calibration data, that is supposed to resemble the data on which the model will make predictions.

Many post-hoc calibration methods take as input prediction logits instead of the final probability vectors. Notice that this does not limit their applicability since the outputs q of a probabilistic model can be mapped to the logit domain through the logarithmic function \(z = \log q + k\), where k is an arbitrary scalar value and the logarithm is applied element-wise. Since post-hoc calibration only requires access to the outputs of the uncalibrated model it is common practice to work directly with logit-groundtruth pairs (zy) directly. So, throughout this work when we talk about the calibration set we might refer to the set defined as:

$$\begin{aligned} \mathcal {D}_\mathrm{{{cal}}}' = \{(f(x), y) | (x, y) \in \mathcal {D}_\mathrm{{{cal}}}\}, \end{aligned}$$

where \(z = f(x)\) are the logit outputs of the uncalibrated model. For the sake of simplicity we will refer to this as just \(\mathcal {D}_\mathrm{{{cal}}}\).

The standard procedure for post-hoc calibration is layed out in Algorithm 1. Note that we can use the same \(\text {trainClassifier}\) function to train the classification model and the calibration model. This is because, as we explained above, post-hoc calibration can be seen as a classification task in which the inputs are the outputs of an uncalibrated model.

Algorithm 1
figure a

Procedure to train and calibrate a classifier.

2.2.1 Accuracy-preserving calibration

Modern classification systems achieve very low test error rates and their miscalibration is attributed mainly to over-confidence---i.e., predicted confidences that call for higher accuracy rates than those actually obtained. Under this assumption, it is reasonable to constrain the calibration transforms so that the predicted ranking over the classes is maintained. This condition is known as accuracy-preserving [22] because functions that meet it do not change the top-label prediction:

$$\begin{aligned} \mathop {\textrm{arg max}}\nolimits _{k} q_k = \mathop {\textrm{arg max}}\nolimits _{k} \hat{q_k}. \end{aligned}$$

When using expressive, and unconstrained, classification models like DNNs for the task of calibration, it is possible to improve calibration at the cost of losing accuracy [19, 23]. This trade-off is avoided by restricting the calibration functions to be accuracy-preserving so that the class decision, left to the classifier, is decoupled from the confidence estimation of each decision.

In this work, we compare only accuracy-preserving methods and avoid altogether the consequently trade-off often encountered in the calibration task: The question of determining which calibrator is better, one that improves more calibration but degrades the accuracy, or one that does not degrade the accuracy but shows less improvement on calibration. This decision is often application dependent but can be circumvented by using an accuracy-preserving method.

2.2.2 Temperature scaling

Temperature Scaling (TS) is probably the most widely used post-hoc calibration approach in the literature. It belongs to the family of accuracy-preserving methods. It scales the output logits by a temperature factor \(T_0\in \mathbb {R}^+\):

$$\begin{aligned} \hat{z} = \frac{z}{T_0}. \end{aligned}$$
(6)

This factor is obtained by minimizing the NLL on some calibration data consisting of predictions of the uncalibrated classifier. Since the NLL is a Proper Scoring Rule, TS is encouraged to improve calibration. The simplicity of the method presents several advantages: The resulting optimization problem of its training has a closed solution and is cheap to compute. Moreover, it is an interpretable method by design since its temperature factor conveys information about the level of over-confidence of the classifier. A high temperature \(T_0 > 1\) flattens the logits so the probability vectors approach the uniform distribution \(q = [1/K, 1/K,..., 1/K]\), thus relaxing the confidences and fixing over-confidence. On the other hand, a low temperature \(T_0 < 1\) sharpens the confidence values moving the top-label predictions toward 1 and the others toward 0. Hence, fixing under-confidence.

3 Methods

In this section, we first describe the Adaptive Temperature Scaling family and illustrate it by proposing some methods of our contribution. Then, we introduce other accuracy-preserving methods, not necessarily of the ATS family, with state-of-the-art performance that we use as benchmarks in the experiments.

3.1 The adaptive temperature scaling family

We use the term ATS family to denote the group of accuracy-preserving maps that generalizes Temperature Scaling and can be expressed as:

$$\begin{aligned} \hat{z} = \frac{z}{T(z)}, \end{aligned}$$
(7)

where \(T: \mathbb {R}^K \mapsto \mathbb {R}^+\) is the temperature function.

This family generalizes Temperature Scaling by making the temperature factor input-dependent. TS is limited to the temperature function \(T(z) = T_0\), where \(T_0\) is the scalar parameter of the model. Hence, TS implicitly assumes that a classifier will generate predictions with the same level of over-confidence independently of the specific sample being classified.

On the other hand, a general ATS method computes a different temperature factor for each prediction via the temperature function T(z). The computed factor for some z estimates the degree of over-confidence of the corresponding prediction \(q = \sigma _\mathrm{{{SM}}}(z)\). Hence, ATS methods acknowledge the possibility that a classifier’s over-confidence may depend on the samples being classified.

The input-dependent property was first exploited by [29] with their Local Temperature Scaling method. However, this approach relies on the classifier input x to estimate a temperature factor \(T_x = T(x)\). An ATS method estimates the factor based on the classifier output instead, \(T_x = T(z)\), thus separating further the calibration step from the original classification task. The former approach tries to learn for which inputs x, e.g., images, the classifier is likely overconfident. ATS, on the other hand, is independent of the classification task requiring only access to precomputed predictions z. In other words, Local TS should be tailored for each classification task. For instance, if the input is audio one might use an RNN but choose a CNN instead for images. But both of them will output a logit vector in a classification task; thus, the input space of ATS methods is always the logit domain so these are more likely to generalize across classification tasks.

We acknowledge that this may reduce the potential expressive power of ATS since z is a processed version of x. Nevertheless, we believe that such constraint is not necessarily detrimental since, as we show in our experiments, the logit vector of a prediction already conveys information about its degree of miscalibration. Moreover, one advantage of post-hoc methods is the decoupling of the classification step from the calibration step. This is in some sense lost if the original classifier input is required for the calibration.

3.2 Proposed methods

We introduce three different ATS methods based on simple temperature functions. For each method we provide an hypothesis of why it can be used for recalibration and how the corresponding temperature function could be interpreted. Before delving into the methods we note that to meet the positivity constraint on the temperature factor we apply the softplus function to the outputs of ATS calibration models:

$$\begin{aligned} \sigma _{SP}(a) = \ln (1 + e^a). \end{aligned}$$
(8)

3.2.1 Linear temperature scaling

We call this method Linear Temperature Scaling (LTS) since it is based on a linear combination of the logit vector, its temperature function is given by:

$$\begin{aligned} T_\mathrm{{{LTS}}}(z) = \sigma _{SP}(w^{L} z + b), \end{aligned}$$
(9)

where \(w^{L} \in \mathbb {R}^K\) and \(b \in \mathbb {R}\) are the learnable parameters of the model.

The weight vector \(w^L\) takes into account the score assigned to each class to determine the level of over-confidence. Hence, LTS can predict higher temperature factors for certain predicted classes than for others. The scalar parameter b allows LTS to recover the base TS by zeroing the \(w^L\) parameter.

We motivate this method by giving the following example: an uncalibrated classifier can make over-confidence predictions for only certain classes. Since LTS weights each component of the logit vector to obtain the temperature factor, it should be able to raise (shrink) it by increasing (decreasing) the weight component \(w_i^L\) depending on whether the classifier is more (less) likely to make an over-confident prediction when predicting class i.

From this follows the interpretation of the method. After fitting LTS on a calibration set, the weight vector will point toward the direction of the highest degree of overconfidence in the logit space.

3.2.2 Entropy-based temperature scaling

Motivated by the fact that the entropy of the predictive class distribution can be interpreted as the uncertainty of such prediction, we propose HTS. The temperature function of this method is given by:

$$\begin{aligned} T_\mathrm{{{HTS}}}(z) = \sigma _{SP}\left( w^H\log \overline{H}(z) + b \right) , \end{aligned}$$
(10)

where \(\overline{H}(z) = H(\sigma _\mathrm{{{SM}}}(z))/\log K\) is the normalized entropy, and \(w^H \in \mathbb {R}\) and \(b \in \mathbb {R}\) are the learnable parameters of the model. We normalize the entropy so that it is always upper-bounded by 1 irrespective of the number of classes. This allows us to generalize the interpretation of \(w^H\) between tasks with a different number of classes. We apply the logarithm to the entropy because, as we show later in the experiments, the temperature shows a linear trend with the logarithm of the entropy. We give b the same interpretation as in the previous model. The parameter \(w^H\) determines how much the uncertainty of the prediction vectors---i.e., the \(\log \overline{H}(z)\)---influences the temperature factor. The higher the magnitude of \(w^H\) the more variability we can expect in the computed temperature factors. On the other hand, a model with \(w^H \rightarrow 0\) will resemble the base TS.

We motivate the method with a hypothetical example. Suppose that we have a classifier that produces predictions with variable degrees of over-confidence. One way in which a prediction-logit can convey information about its level of over-confidence is via its entropy. This is, for two predictions with the same predicted confidence, we may assume that the more uncertain of the 2---i.e., the higher entropy prediction---is more likely to be over-confident since it reports the same value of confidence despite its higher uncertainty.

This model makes a strong assumption about the level of over-confidence in a prediction. Mainly, that it can be expressed as a simple linear function of the log-entropy. The resulting model is easy to train since the set of possible calibration functions, or hypothesis space, is constrained by the number of trainable parameters. However, its performance is completely conditioned on the assumption being met. We provide experiments validating the model and its assumptions in Sect. 4.

3.2.3 Combined system

Finally we propose HnLTS, a model that combines the previous two with a single temperature function given by:

$$\begin{aligned} T_{HnLTS}(z) = \sigma _{SP}\left( w^{L} z + w^H\log \overline{H}(z) + b \right) , \end{aligned}$$
(11)

where \(w^{L} \in \mathbb {R}^K\), \(w^H \in \mathbb {R}\), and \(b \in \mathbb {R}\) are the learnable parameters to which we give the same interpretation as above.

The motivation behind this model is to increase the expressiveness of the system in a controlled way to see how this affects its performance and training procedure compared to the more simpler methods. The hypothesis space of this method is a combination of LTS and HTS so it should be able to, at least, recover the solution of either one and achieve the same or better performance. However, we argue that the increased hypothesis space also makes the model more difficult to train with higher data requirements and this may impact its applicability depending on the task at hand, specially in data-scarce scenarios.

3.3 Baseline methods

We now describe other accuracy-preserving methods already existing in the literature with state-of-the-art performance. Some of these, but not all of them, belong to the ATS family as they can be expressed in the general form given by Eq. 3.1.

3.3.1 Parameterized temperature scaling

parameterized Temperature Scaling (PTS) [15] is a specific instance of the ATS family in which the temperature function is conditioned to be a neural network (NN). The input to the NN is the logit vector sorted by decreasing value of confidence \(z^s\). Sorting the logit vector makes the model order-invariant [23] simplifying the hypothesis space at the cost of losing the possibility of discriminating between classes---i.e., it cannot consider the predicted ranking over the classes. PTS can be expressed as an ATS method with temperature function:

$$\begin{aligned} T_\mathrm{{{PTS}}}(z) = \textit{NN}(z^s), \end{aligned}$$
(12)

where NN is the function defined by the neural network.

Instead of optimizing the parameters of the NN to minimize some PSR as other methods do, authors propose to minimize an ECE-based loss given by:

$$\begin{aligned} L_\mathrm{{ECE}} = \sum _{i=1}^M \frac{|B_i|}{|\mathcal {D}_\mathrm{{{test}}}|}\,\left\Vert \mathrm{{acc}}(B_i) - \mathrm{{cof}}(B_i)\right\Vert _2, \end{aligned}$$
(13)

where \(B_i\), \(\mathrm{{cof}}(B_i)\), and \(\mathrm{{acc}}(B_i)\), are defined as in Eq. 2.1.1. During training, samples are re-partitioned into \({B_i}\) at each loss evaluation since the confidence is re-scaled differently.

In their experiments, authors always use the same architecture, a Multi-Layer Perceptron (MLP) with two 5-unit hidden layers. Authors limit the input size of the network to the 10 highest confidence values whenever the number of classes is greater than 10. We use the same architecture in our experiments.

3.3.2 Bin-wise temperature scaling

Bin-Wise Temperature Scaling (BTS) [39] is a histogram-based method that applies a different temperature factor to each bin of the histogram. First, test samples are partitioned into N bins according to their top-label confidence. Authors force a high-confidence bin that ranges from 0.999 to 1. The samples with predicted confidence below 0.999 are partitioned into the other \(N-1\) intervals such that each bin contains the same number of samples.

This method can also be included in the ATS family. The temperature function in this case is just a look-up table that assigns the corresponding temperature factor to the input confidence value.

3.3.3 Ensemble temperature scaling

Ensemble Temperature Scaling (ETS) [22] obtains a new logit vector as a convex combination of the uncalibrated vector, a maximum entropy logit vector, and the temperature-scaled vector:

$$\begin{aligned} \hat{z}&= w_1\frac{z}{T_\mathrm{{{ETS}}}} + w_2 z + w_3 \frac{1}{K}, \\&\text {subject to} \, w_1+w_2+w_3 = 1; w_i \ge 0 \nonumber \end{aligned}$$
(14)

where \(w_1\), \(w_2\), \(w_3\) are the learnable weights of the convex combination and \(T_\mathrm{{{ETS}}}\) is the temperature parameter of the TS component. All the parameters are optimized en bloc to minimize some PSR.

This method is also an extension of the standard TS, however, it does not belong to the ATS family. This can be easily verified by noting that ATS methods compute for some logit vector a single scalar temperature factor which applies equally to every entry of the logit vector. On the other hand, ETS scales by a different temperature factor each component of the logit vector.

4 Experiments

We present two sets of experiments. First, we report a study of the proposed methods that motivate their design and present ways in which calibration performance improves with model complexity. With these, we give evidence that the logit vector conveys information about its degree of over-confidence and motivate the design of new calibration methods that take this into account. Then, we compare our methods with other state-of-the-art accuracy-preserving calibration techniques in different dataset-size settings to assess their robustness to data scarcity.

Fig. 2
figure 2

Mean predicted temperature (blue) against optimum temperature factor (green) for each class on the test set

4.1 Setup

4.1.1 Datasets and tasks

We refer to model-dataset pairs as calibration tasks. So a task is composed of the predictions of a model, for instance a ResNet-101 [40], on a specific dataset, like CIFAR-100 [41]. Every dataset is partitioned into three splits: train, calibration, and test. The model of each task is trained using the train set and then it is used to generate predictions on the calibration and test sets. We evaluate a calibration method on a certain task using the following procedure: First we fit the calibration method using the predictions on the calibration set. Then we apply it to the test set predictions and compute metrics over these.

4.1.2 Training details

We use NLL as the optimization objective to fit calibrators. Additionally, in all tasks, we fit a second version of the PTS method, minimizing the ECE-based loss instead (see Sect. 3.3). All methods except TS and ETS are implemented in Pytorch [17] and optimized using Stochastic Gradient Descent (SGD) with an initial learning rate of \(10^{-4}\), Nesterov momentum [42] of 0.9, and a batch size of 1000. We reduce the learning rate on plateau by a factor of 10 until the learning rate reaches \(10^{-7}\), at that point we consider that the algorithm has converged and we stop training. These are default hyperparameters suggested by authors and provide good convergence across all calibration tasks in our experiments. The standard TS is optimized with SciPy [43] and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. To calibrate with ETS we use the code uploaded by authors [22].

4.2 Analysis of the ATS methods

For the first round of experiments, we first calibrate a ResNet-50 [40] on CIFAR-10 [41] with the proposed interpretable ATS methods, Entropy-based TS (HTS) and Linear TS (LTS), and discuss each separately. Then, we present an study of the relation between the number of parameters of a model and its data requirements for training. We calibrate a DenseNet-121 [44] trained on CIFAR-100 [41] with LTS, HTS, and HnLTS using varying number of calibration samples.

4.2.1 Linear TS: Introducing class dependence

With this experiment, we aim to illustrate the example that we give to motivate the LTS method. This is, that LTS can adapt to a classifier that makes more or less over-confident predictions depending on which class it predicts as correct.

First, we train a ResNet-05 [40] on the CIFAR-10 train dataset. Then, we use the calibration set to calibrate the model using TS and the LTS method.

To check our hypothesis, we divide the test set according to their true class:

$$\begin{aligned} \mathcal {D}_{k} = \{(x, y) \in \mathcal {D}_\mathrm{{{test}}}| \textrm{arg max}(y) = k\}, \end{aligned}$$

and compute for each subset \(\mathcal {D}_k\) the optimum temperature factor \(T_k\), which is obtained by optimizing TS directly on \(\mathcal {D}_k\).

Then, we use the LTS model optimized on the calibration set to compute a temperature factor for every test prediction, group them according to their true class, and take the average for every group:

$$\begin{aligned} \hat{T}_k = \frac{1}{|\mathcal {D}_k|}\sum _{(x \in \mathcal {D}_k)} (T_\mathrm{{{LTS}}}(f(x))). \end{aligned}$$

Finally, we represent in Fig. 2\(\hat{T}_k\) against the optimum temperature \(T_k\). For reference, we include the TS temperature factor learned on the calibration set (dashed orange line).

From Fig. 2 we notice that the classifier does produce more over-confident predictions for some classes than for others, even in a curated and well-balanced dataset such as CIFAR-10. We can expect this effect to be even more present in real-life applications in which the prevalence of classes may vary and some distribution mismatch between development and production data can be expected. LTS exploits this difference between classes and manages to adapt the temperature factor in each subset closely matching the optimum.

4.2.2 Entropy-based TS: Leveraging uncertainty of predictions

Our motivation for the HTS method is that the level of over-confidence in a prediction is related to the entropy of such prediction. If our hypothesis is correct, we can expect, for the same value of confidence in the predicted class, higher entropy predictions to be more over-confident on average. So, we might expect higher temperature factors for higher entropy predictions. The relation between overconfidence and entropy has already been explored and exploited in [38]. However, in this case, authors penalize low-entropy predictions as over-confident. Our hypothesis does not necessarily oppose to theirs; we suggest that the inverse relation exists for fixed confidence values, while theirs account for a general relationship across confidence values. The relation is driven in this case by the fact that high values of confidence for a given class imply low entropy in the prediction.

Fig. 3
figure 3

Temperature function of HTS fitted using 200 (light blue) and 10000 (dark blue) calibration samples and optimal temperature on the test set (green)

Fig. 4
figure 4

Temperature factors of PTS for test samples fitted using 200 (light orange) and 10000 (dark orange) calibration samples and optimal temperature on the test set (green)

In Fig. 3 we depict the temperature function learned by HTS in the calibration set. We train two models, one with the full calibration set, plotted in a darker shade, and the other using a random subset of 200 samples. We also plot the optimum temperature factor estimated in the test set for different ranges of normalized entropy. We partition the log-domain of the normalized entropy into 15 equally spaced bins and divide the test samples according to this binning scheme \(\mathcal {D}_\mathrm{{{test}}} = \{\mathcal {D}_1,..., \mathcal {D}_{15}\}\). For each set \(\mathcal {D}_i\), we estimate the optimum temperature \(T_i\) factor given by TS in the same way we do in the first experiment:

$$\begin{aligned} {T}_i = TS(\mathcal {D}_i). \end{aligned}$$

Note how for values of \(\log \overline{H} \ge 10^{-5}\) the optimum temperature increases linearly with the log-entropy, and that this behavior is captured by the temperature function learnt by HTS. We also notice that when there is plenty of calibration data, the model captures better the linear trend, but when data is limited the slope flattens---i.e., \(w^{H} /to 0\)--- approaching the standard TS.

We also fit the PTS method on the same subsets as HTS resulting in 2 calibration methods, PTS trained with 200 samples and PTS trained with the full 10000 calibration samples. Then we plot the temperature factor that each PTS assigns to each prediction on the test set on Fig. 4. With this plot, we aim to see if a very expressive method like PTS learns any relation between the entropy of a prediction and its temperature factor. Note how the model trained with full data produce temperature factors that resemble the linear trend showed by HTS, but that when data is scarce PTS fails to grab any relation and produce very variable factors.

We find that, at least in this particular task, there exists some positive relation between the entropy of the predictive and its level of over-confidence. Figure 3 shows that a linear function is a fair approximation to the relation between entropy and temperature and that HTS manages to capture it even in the face of low data. We can check in Table 1 that PTS and HTS are highly correlated when both methods are trained using the whole calibration set. Moreover, if HTS is trained using only a small subset of the data (\(N=200\)), this method still produces temperature factors highly correlated with those issued by PTS on a large set. On the other hand, the temperature factors given by PTS trained with low-data (\(N=200\)) are much less correlated with those given by the full-data trained PTS. This suggests that the behavior that PTS has to learn from abundant data, is already ingrained in the inductive bias of HTS.

Fig. 5
figure 5

Temperature factor computed by PTS against temperature factor computed by HTS for test set predictions. The black dotted line represents a perfect one to one relation

Table 1 Sample Pearson correlation coefficient between the temperatures given by HTS and PTS temperature functions on the test set.

In Fig. 4 we show that a much more expressive method like PTS also captures this linear relationship when given enough data. However, in the face of limited data, it fails to do so. Moreover, in Fig. 5 we plot for all samples in the test set the temperature factors given by PTS against those by HTS, both methods fitted using all samples. The plot shows that when data is plentiful the function learnt by both methods is reasonable similar, suggesting that the function space of HTS contains well-performing solutions similar to those learnt by PTS despite being much more constrained.

4.2.3 Performance degradation with model complexity

In order to validate our hypothesis that more complex methods require more data and are inherently more difficult to fit, we conduct the following experiment. We use a DenseNet-121 [44] trained on CIFAR-100 train dataset to generate predictions on a calibration set, disjoint from the training set, of 10000 samples. From it we sample 2 additional subsets with 200 and 1000 samples respectively, having in total 3 calibration sets with increasing number of samples. We use each calibration set to fit the TS, LTS, HTS and HnLTS methods giving a total of 12 different combinations.

We compare LTS, a highly parameterized model, hence more expressive, with HTS, a 2 parameter model but with a strong inductive bias. We also include HnLTS that acts as a control in the following sense: If there is a difference in performance between LTS and HTS in low-data scenarios, one could argue that LTS is not expressive enough and that its complexity does not hinder the training. Would that be the case, HnLTS should be able to match the HTS performance since its function space includes all solutions of both models, HTS and LTS. So we expect HnLTS to fail in the same low-data scenerarios as LTS because of being highly parameterized. The TS scaling method is included as baseline.

We evaluate all of them in the CIFAR-100 test dataset and report performance in terms of NLL and ECE using (\(M=50\)) bins in Tables 2 and 3 respectively for each scenario we mark the score (either NLL or ECE) of the best performing model in bold.

Table 2 NLL for different calibration models
Table 3 ECE (M=50) for different calibration models

We first note that when data is plentiful (\(N=10000\)) all calibration models achieve similar performance both in terms of NLL and ECE. But as we reduce the number of calibration samples (\(N=1000\) and \(N=200\)), the performance of both LTS and HnLTS are greatly impacted. On the other hand, its remarkable how with only \(N=200\) calibration samples both simple models maintain a similar performance.

4.3 Benchmarking

In this section, we compare the performance of the proposed ATS methods: LTS, HTS, and HnLTs; with state-of-the-art accuracy-preserving methods: TS, ETS, BTS, and PTS. We fit two versions of PTS: One trained to minimize the NLL, the calibration objective we use to train every method; and a second version optimizing the ECE-based loss instead as reported in [15](see Sect. 3.3). We refer to the former as PTS and the latter as PTSe where the ‘e’ stands for the ECE-based objective.

All experiments are run 50 times with different random initializations and the results are averaged across runs. For experiments in which we subsample the calibration set to simulate data-scarcity scenarios, we sample one subset at each run and use that subset to train all calibration methods. So, in each of the 50 runs, all the calibration methods see the same N-sized calibration set, but this subset is different across tasks. The training procedure is layout in Algorithm 2.

Algorithm 2
figure b

Procedure to evaluate calibration models on a model-dataset task.

4.3.1 Results

Fig. 6
figure 6

Average results for CIFAR-10 (left) and CIFAR-100 (right) tasks of all calibration methods in terms of ECE (top) and NLL (bottom) normalized by the performance of TS, namely \(\overline{ECE}\) and \(\overline{NLL}\)

For the sake of space and simplicity, we depict results for each dataset and average across models---e.g., average ECE of HTS on all CIFAR-10 tasks. We defer detailed results to Appendix 1. Results are shown in Fig. 6. We normalize each metric by the performance of TS as we consider it the main benchmark. The performance of the base TS is computed in the same conditions as the evaluated calibrator---i.e., trained and evaluated using the same calibration and test sets. We report performance in terms of normalized ECE and normalized NLL, namely \(\overline{ECE}\) and \(\overline{NLL}\). For each method, we plot five markers, the size of which increases with the size of the data set. From smallest to biggest these are \(N = (200, 500, 1000, 5000, 10000)\). The y-axis position of the marker indicates the mean value across tasks, where each task is a different NN architecture calibrated.

We first point out that almost all models outperform the simple TS when there is enough data (big markers), although, on average, there are no big differences between models. However, when data is scarce all the highly-parameterized models show severe performance degradation and only ETS and HTS seem to provide consistent performance. Moreover HTS provides better results in most of the individual tasks while ETS barely outperforms the baseline TS.

Also, it is worth noting the difference between datasets. In the highly dimensional CIFAR100, we can see a greater advantage in using calibration methods more complex than TS. On the other hand, the best methods barely outperform TS in CIFAR10 tasks. This suggests that the problem of calibration may grow more complex with the number of classes, although the number of datasets included in our experiments is limited and more experiments are required to validate this observation.

Interestingly, HnLTS fails in low-data scenarios, even though it could, in theory, recover the HTS solution by zeroing the \(w^L\) parameter. This suggests that increasing expressiveness can do more harm than good by complicating the training objective.

5 Conclusions

We have shown that post-hoc calibration of DNNs can benefit from more expressive models than the widely used Temperature Scaling, especially in tasks with a high number of classes. For instance, simply adjusting the temperature factor of TS with a linear combination of the logit prediction improves calibration by taking into account the score assigned to each class.

However, more complex models require higher amounts of data to find a good-performing solution. This poses a trade-off between the complexity of the calibration model and the available data to train the model. There are many real-world tasks where data for re-calibration is limited and hinders the calibration with a complex model.

By analyzing the calibration functions learned by expressive models on plenty of data, we can design simpler models with a strong inductive bias toward similar calibration functions. In this work, we have introduced HTS, a 2-parameter model that scales predictions according to their entropy. The temperature factors estimated by PTS, a much more expressive model, follow the same linear relation with the predictive entropy that HTS implicitly assumes. HTS shows calibration performance comparable to that of more expressive methods on ideal data conditions. However, the application of these other methods is limited to conditions where data is plentiful and one has to resort to simpler methods like TS when data is limited. On the other hand, HTS maintains good performance across all ranges of data availability making it a good default for calibration. Moreover, an important feature of the model is that it is interpretable, characterizing the link between a prediction’s uncertainty and its over-confidence.

With this work, we motivate the study of expressive methods as a way to design practical models with a suitable inductive bias. As a first approach, we propose to use a hand-designed low-parameter model to achieve this bias. This model achieves comparable performance to other state-of-the-art methods while being robust to low data scenarios. In future work, we plan to try other forms of inducing the desired bias, for instance, via the prior specification in a Bayesian inference setting. This option may allow training higher capacity models while still being robust to data scarcity.