Uncertainty quantification in neural network classifiers—A local linear approach ✩

Classifiers based on neural networks ( nn ) often lack a measure of uncertainty in the predicted class. We propose a method to estimate the probability mass function ( pmf ) of the different classes, as well as the covariance of the estimated pmf . First, a local linear approach is used during the training phase to recursively compute the covariance of the parameters in the nn . Secondly, in the classification phase, another local linear approach is used to propagate the covariance of the learned nn parameters to the uncertainty in the output of the last layer of the nn . This allows for an efficient Monte Carlo ( mc ) approach for; (i) estimating the pmf ; (ii) calculating the covariance of the estimated pmf ; and (iii) proper risk assessment and fusion of multiple classifiers. Two classical image classification tasks, i.e., mnist , and cfar 10, are used to demonstrate the efficiency of the proposed method. © 2024TheAuthor


Introduction
In this paper, the problem of quantifying the uncertainty in the predictions from a neural network (nn) is studied.The uncertainty in the prediction stems from three different sources: errors caused by the optimization algorithm that is used to train the nn, errors in the data (aleatoric uncertainty), and errors in the model (epistemic uncertainty).In this paper, the focus is on uncertainty from the two latter sources.
In numerous applications, e.g., image recognition (Krizhevsky, Sutskever, & Hinton, 2012), and various control tasks (Karlsson & Hendeby, 2021;Li et al., 2017), nns have shown high performance.Despite their high performance, the use of nns in safety-critical applications is limited (Grigorescu, Trasnea, Cocias, & Macesanu, 2020;Paleyes, Urma, & Lawrence, 2020).It is partly a consequence of the fact that their predictions usually do not come with any measure of certainty of the prediction, which is crucial to have in a decision-making process to know to what ✩ This work is supported by Sweden's innovation agency, Vinnova, through project iQDeep (project number 2018-02700).Funding was also provided from the Swedish Research Council for the project Scalable Kalman Filters.The material in this paper was presented at The 25t IEEE International Conference on Information Fusion (FUSION), July 4-7, 2022, Linköping, Sweden.This paper was recommended for publication in revised form by Associate Editor Dario Piga under the direction of Editor Alessandro Chiuso.degree the prediction can be trusted.In particular, this need was highlighted in the fatal Uber accident in 2018 where the lack of reliable classifications of surrounding objects played a role in the development of events that eventually led to the accident (NTSB, 2018).Moreover, the quantified measure of uncertainty can be used to detect and remove outliers in the data, and without knowledge about the uncertainty it is not possible to fuse the prediction from the nn with information from other sensors.
The problem to quantify the uncertainty in the prediction of nns has lately gained increasing attention, and numerous methods to calculate the uncertainty have been suggested (D'Amour et al., 2020;Ghahramani, 2015;Lin, Clark, Trigoni, & Roberts, 2022;Ovadia et al., 2019;Patel & Waslander, 2022).For a survey of methods see Gawlikowski et al. (2023).The methods suggested in the literature can broadly be divided into one out of two categories.One category is based on creating an ensemble of predictions from which the uncertainty in the prediction is computed (Ayhan & Berens, 2018;Gal & Ghahramani, 2016;Ilg et al., 2018;Lakshminarayanan, Pritzel, & Blundell, 2017;Maddox, Izmailov, Garipov, Vetrov, & Wilson, 2019;Osawa et al., 2019;Teye, Azizpour, & Smith, 2018).In the other category, the nn structure is extended and the nn is trained to learn its own uncertainty (Blundell, Cornebise, Kavukcuoglu, & Wierstra, 2015;Charpentier, Zügner, & Günnemann, 2020;Gustafsson, Danelljan, Bhat, & Schön, 2020;Izmailov, Nicholson, Lotfi, & Wilson, 2021;Kendall & Gal, 2017).Concerning the first category, it has for example been suggested to create an ensemble by training multiple nns, from whose predictions the uncertainty is computed by Lakshminarayanan et al. (2017).Since training a single nn is often computationally expensive, this method has high computational complexity.In practice, it is only feasible from a computational perspective to create small ensembles.To decrease the computational complexity, it was in Gal and Ghahramani (2016) and Teye et al. (2018) suggested to use already existing regularization techniques to sample values of the parameters of the nn to create these ensembles.Another method to create ensembles is by sampling values of the parameters during the last part of the training phase (Maddox et al., 2019;Osawa et al., 2019).So-called test-time data augmentation methods have also been suggested to do perturbation on the test data to create an ensemble of predictions (Ayhan & Berens, 2018).Even though the methods in Ayhan and Berens (2018), Gal and Ghahramani (2016), Maddox et al. (2019), Osawa et al. (2019) and Teye et al. (2018) do not need multiple models to be trained they require multiple forward passes.Furthermore, they require specially tailored training algorithms and carefully constructed structures of the nn.
Another limitation of methods relying on creating ensembles is that they have trouble representing the uncertainty caused by the bias in the prediction from a model mismatch.The bias can be caused by an insufficiently flexible model, which could be a result of too high regularization or too low model order.The problem can be solved by nns from the second category, i.e., where the structure of the nn is extended such that it learns its own uncertainty in the prediction.However, this requires a more intricate nn structure with tailored loss functions (Blundell et al., 2015;Charpentier et al., 2020;Gustafsson et al., 2020;Kendall & Gal, 2017).As a consequence, the training becomes more complex and computationally expensive.It also makes the methods sensitive to errors caused by the training algorithm, which are not possible to learn.Furthermore, there is also a need for more data to train complex model structures.
In this paper, we address the two limitations of the aforementioned methods using classical local approximations from the area of system identification (Ljung, 1999) based on a linearization of the model.Where the system identification methodology works well for static systems which nns are an example of Avant andMorgansen (2023) andShen, Varshney, andZhu (2014).We will refer to this as the delta method (Liero & Zwanzig, 2011;Malmström, 2021;Malmström, Skog, Axehill, & Gustafsson, 2022).For regression tasks, the delta method has previously been used to quantify the uncertainty in the prediction of nns, see e.g., Deng, Zhou, and Zhu (2022), Hwang and Ding (1997), Immer, Korzepa, and Bauer (2021), Liero and Zwanzig (2011) and Malmström (2021), and extended to classification tasks in Malmström et al. (2022).

Problem formulation and contributions
Consider the problem of learning a classifier from the training data set Here y n ∈ {1, . . ., M} is the class labels and x n ∈ R nx is the input data of size n x , e.g., pixels in an image.From a statistical point of view, the learning of the classifier can be seen as a system identification problem where a model f (x; θ) that predicts the conditional probability mass function (pmf) p(y|x) of a categorical distribution, are to be identified.That is, the probability for y = m given the input x is modeled as (2) Here θ ∈ R n θ denotes the n θ -dimensional parameter vector that parameterizes the model.Further, the subscript m denotes the m'th element of the vector-valued output of the function.To ensure that the model f (x; θ) fulfills the properties associated with a pmf, i.e., f m (x; θ) ≥ 0 ∀m and ∑ m f m (x; θ) = 1, it is typically structured as and g(x; θ) describes the underlying model of the classifier.For example, if g(x; θ) is given by an nn the parameters θ consist of all the weights and biases in the nn.In the case g m (x; θ) = θ ⊤ φ m (x), where φ m (x) denotes, a possible nonlinear, transformation of the input x, then the model in (3a) becomes a standard multinomial logistic regression model (Baggio, Carè, Scampicchio, & Pillonetto, 2022;Lindholm, Wahlström, Lindsten, & Schön, 2022).Furthermore, if the transformation φ m (x) is chosen randomly, the model becomes similar to the one used in extreme learning machine classifiers (Huang, Zhu, & Siew, 2006).

Parameter estimation
For most nn the number of model parameters n θ > N and the model parameters θ cannot be uniquely identified from the training data T without some regularization or prior information regarding the parameters.Let p(θ ) denote the prior for the model parameters.The maximum a posteriori estimate of the model parameters is then given by θN = arg max where p(θ |T ) denotes the a posteriori distribution of the parameters and denotes the cross-entropy likelihood function (Lindholm et al., 2022).Here y n is used as an index operator for the subscript m of f m (x; θ).

Prediction and classification
Once the classifier has been learned, i.e., a parameter estimate θN has been computed, then for a new input data point x ⋆ the pmf can be predicted as and the most likely class can be found as ŷ⋆ = arg max m f m (x ⋆ ; θN ).Note that, the full pmf estimate f (x; θN ) is needed both for tem- poral fusion using several inputs from the same class and fusion over different classifiers.Furthermore, even small probabilities can pose a large risk, e.g., there might be a pedestrian in front of a car even if another harmless object is more likely according to the classifier.Hence, it is important that the prediction p(y ⋆ = m|x ⋆ ; θN ) is accurate.However, it is well known that due to, among other things, uncertainties in the parameter estimates θN the disagreement between true and estimated pmf may be significant.Therefore, methods to calibrate the prediction p(y ⋆ |x ⋆ ; θN ) such that it better matches p(y ⋆ |x ⋆ ) has been developed.

Temperature scaling
One of the most commonly used methods to calibrate the predicted pmf is called temperature scaling (Guo, Pleiss, Sun, & Weinberger, 2017).In temperature scaling, g(x; θ) is scaled by a scalar quantity T before the normalization by the softmax operator.With a slight abuse of notation, introduce ) . (7) Using the temperature scaling parameter T , the variations between the components (classes) in the predicted pmf can be enhanced or reduced.When T → 0, then f (x ⋆ ; θN , T ) → ⃗ e i , where ⃗ e i denotes the i'th standard basis vector, thereby indicating that input x ⋆ n with total certainty belongs to class i.Similarly, when T → ∞, then f m (x ⋆ ; θN , T ) → 1/M ∀m, thereby indicating that input x ⋆ n is equally probable to belong to any of the classes.Noteworthy is that the temperature scaling is typically done after the parameters θ have been estimated, where it is estimated using validation data.For notational brevity, the dependency on the temperature scaling parameter T will only be explicitly stated when temperature scaling is considered.

Marginalization of parameter uncertainties
A more theoretically sound approach to take the uncertainties in the parameter estimate into account is via marginalization of the pmf with respect to the parameter distribution.That is, an estimate of the pmf and its covariance are calculated as From hereon, (x)(•) ⊤ is used as a shorthand notation for xx ⊤ .The integral in (8a) is generally intractable, but can be approximated by Monte Carlo (mc) sampling as Here K denotes the number of samples used in the mc sampling.

Challenges and contributions
To realize the mc scheme in (9) the posterior parameter distribution p ( θ|T ) must be computed and samples drawn from this high-dimensional distribution.Our contributions are: (i) a local linearization approach that leads to a recursive algorithm of low complexity to compute an approximation of p ( θ|T ) during the training phase; (ii) a second local linearization approach to reduce the sampling space from n θ to M-dimensional space in the prediction phase; and as a by-product (iii) a low-complexity method for risk assessment and information fusion.

Posterior parameter distribution
Next, a local linear approach to compute an approximation of p ( θ|T ) during the training phase is presented.

Laplace approximation
Assume the prior distribution for the model parameters to be normal distributed as p(θ ) = N (θ ; 0, P 0 ), i.e., l 2 regularization is used.Then a Laplacian approximation of the posterior distribution p(θ |T ) yields that (Bishop, 2006) That is, the prior distribution is approximated by a normal distribution with a mean located at the maximum a posteriori estimate and a covariance dependent upon the shape of the likelihood function in the vicinity of the estimate.The accuracy of the approximation will depend upon the amount of information in the training data T .A weakness with the Laplacian approximation is that it is local and assumes θN to be a local minimum and can hence miss global trend in the posterior distribution (Bishop, 2006).

Asymptotic distribution
According to Bernstein-von Mises theorem (Johnstone, 2010), if the true model belongs to the considered model set, the maximum a posteriori estimate θ converges in distribution to when the information in the training data T tends to infinity.
Here, θ * denotes the true parameters and is the Fisher information matrix.Given the likelihood function in (5) the Fisher matrix becomes See derivations in the Appendix.

Recursive computation of covariance
To compute the parameter covariance P θ N defined by (10b), the Hessian matrix of the log-likelihood (ll) must be calculated and then inverted.This has a complexity of O(NMn 2 θ + n 3 θ ), which for large n θ in combination with large N can become intractable.However, by approximating the Hessian matrix of the ll with the Fisher information matrix using (13), the computation can be done recursively and with a complexity of O ( NMn 2 θ + NM 3 ) (Malmström et al., 2022).Note that I θ in (13) can be written in a quadratic form.As a result the recursive update can be given as where I r denotes the identity matrix of size r.Here P θ n is the parameter covariance for n measurements, which is initialized as P θ 0 = P 0 .To compute u mn ∈ R n θ only the gradient of the ll in (5) is required, which is nevertheless needed for the estimation of θ.

Approximating the covariance
An nn often has millions of parameters which might result in the amount of data needed to store P θ N being larger than the available memory capacity.A common approach to handle this is to approximate P θ N as a block-diagonal matrix (Martens & Grosse, 2015).Another common approach is to use the approximation where P θr N denotes the covariance of the estimated parameters θ r corresponding to the weights and biases of the r last layers in the nn (Kristiadi, Hein, & Hennig, 2020;Malmström et al., 2022).
Depending on the number of included layers, this approximation might be more or less accurate.To compensate for the approximation error when doing the marginalization in (8), a scaling of P θr N with factor T c ≥ 1 can be introduced.The scaling can be estimated from validation data in a similar manner to the temperature scaling T in Section 2.3.

Efficient MC sampling
With access to the parameter covariance, one can propagate the uncertainty in the parameters to uncertainty in the prediction with the delta method using the principle of marginalization.Plugging in the approximate Gaussian distribution (11) into (8a).Then an mc approximation can be performed by changing (9a) to This is a feasible solution to the problem, but it comes with a high computational cost since it requires drawing mc samples from a high-dimensional Gaussian distribution and evaluating the whole network.

Marginalization using the delta method
The delta method, see e.g., Liero and Zwanzig (2011) and Malmström (2021), relies on linearization of the nonlinear model g(x, θ) and provides a remedy to the problem of sampling from the high-dimensional Gaussian distribution.The idea is to project the uncertainty in the parameters to uncertainty in the prediction before the softmax normalization (3b), thereby drastically reducing the dimension of the distribution that must be sampled.Using the delta method, the uncertainty in the parameters can be propagated to the prediction before the softmax normalization as ; θ) Using this Gaussian approximation of the parameter distribution, the mc approximation of the marginalization integral becomes To summarize, the main idea of the delta method is linearization performed in two steps.First, the parameter uncertainty is computed using (11), and second, the uncertainty is propagated to the output of the model by ( 17).Hence, the delta method is a local linear approach that gives a linear approximation of a nonlinear model.

Fusion
Suppose there is a set of independent classifiers, each one providing a marginal distribution N ( g N,c ; ĝN,c , P g N,c ), c = 1, . . ., C .

(19b)
If a single classifier is used to classify a set of inputs x ⋆ c , c = 1, . . ., C , known to belong to the same class y ⋆ , then these predictions can be fused as follows where After fusion, the mc sampling in ( 18) can be applied as before to compute the pmf estimate.

Risk assessment
Risk assessment can be defined as the probability r m that p(y ) . ( Here 1(a > b) denotes the indicator function which is one if a > b and zero otherwise.Note that the parameter γ m is free to choose and corresponds to the accepted failure rate.

Suppose now we have a validation data set
In this section we will investigate metrics how to validate the estimated pmf f (x • n |T ) obtained from ( 18).The inherent difficulty is that the validation data, just as the training data, consists of inputs and class labels, not the actual pmf.Indeed, there is a lack of unified qualitative evaluation metrics (Gawlikowski et al., 2023).That being said, some of the most commonly used metrics are classification accuracy, ll, Brier score, and expected calibration error (ece) (Wójcik, Grela, Śmieja, Misztal, & Tabor, 2022).
Both the negative ll and the Brier score are proper scoring rules, meaning that they emphasize careful and honest assessment of the uncertainty, and are minimized for the true probability vector (Gneiting & Raftery, 2007).However, neither of them is a measure of the calibration, i.e., reliability of the estimated pmf.
Out of these metrics, only ece considers the calibration.Hence, here ece is the most important metric when evaluating a method used to measure the uncertainty (Guo et al., 2017;Vaicenavicius et al., 2019).To evaluate a method that quantifies uncertainty in the prediction, one can also measure its capability to distinguish between in-and out-of-distribution (id and ood) inputs, i.e., find outliers.

Brier score, accuracy, and reliability diagram
The Brier score (Gneiting & Raftery, 2007) corresponds to the least squares fit 1 where δ i,j denotes the Kronecker delta function.Furthermore, p(y • n = m|x • n ) denotes a generic pmf estimate.Accuracy and reliability diagrams are calculated as follows.Calculate the J bin histogram defined as from the validation data.For a perfect classifier B j = ∅ for j < J.
For a classifier that is just guessing, all sets are of equal size, i.e., |B j | = |B i | ∀i, j.Note that max m p(y so the first bins will be empty if J > M. The accuracy of the classifier is calculated by comparing the size of each set with the actual classification performance within the set.That is, where ŷ A reliability diagram is a plot of the accuracy versus the confidence, i.e., the predicted probability frequency.A classifier is said to be calibrated if the slope of the bins is close to one, i.e., when acc(B j ) = (j − 0.5)/J (Guo et al., 2017).

Confidence and expected calibration error
Instead of certainty, from hereon the standard, and equivalent, notion of confidence will be used (Guo et al., 2017;Vaicenavicius et al., 2019).The mean confidence in a set is denoted conf(B j ) and is defined as This is a measure of how much the classifier trusts its estimated class labels.In contrast to the accuracy it does not depend on the annotated class labels y n .Comparing accuracy to confidence gives the ece, defined as A small value indicates that the weight is a good measure of the actual performance.

Out-of-distribution detection
To evaluate the capability of a method to find ood samples, the so-called area under the receiver operating characteristic (auroc) curve and area under the precision-recall curve (aupr) is often used (Goodfellow, Bengio, & Courville, 2016).Both the auroc and the aupr are between zero and one, and for a perfect detector, the value is one.Intuitively, auroc is a measure of the proportion of detections, while aupr measures the proportions of detections that are correct.Another approach to evaluate the capability of ood detection is to study the difference in entropy between id and ood inputs ∆ s = S id − S ood where the entropy S is defined as ) . (27)

Experiment study
To illustrate the application of the proposed method to quantify uncertainty in the prediction, two datasets were used.First, an nn was trained using the mnist dataset (LeCun, Cortes, & Burges, 1998) to classify images of handwritten digits.Second, an nn was trained on the cfar10 dataset (Krizhevsky, 2009) to classify images of ten different objects including e.g., cars, cats, and aircraft.For ood detection mnist was used as id and Fashion mnist (fmnist) (Xiao, Rasul, & Vollgraf, 2017) was used as ood.

Classification setup
For the mnist dataset, a five-layer nn with fully-connected nodes was used.For the cfar10 dataset, a LeNet5-inspired structure was used with six convolutional layers followed by four fully connected layers.However, for both datasets, the three last layers were chosen to have the same structure, fully connected with 100, 40, and 10 hidden nodes.To decrease the size of the parameter covariance matrix used by the delta method, as described in Section 3.4, the first part of the nn was assumed fixed and used to create high-level features.Since the structure of the later layers was chosen identically, both models had n θ = 4450 parameters.To estimate the model parameters θ of the nn, the adam optimizer (Kingma & Ba, 2015) was used with standard settings.Three and ten epochs were used to train the mnist dataset and cfar10 dataset, respectively.Here, l 2 regularization of 10 −4 , were used, hence the prior P 0 = 10 −4 I n θ .When computing P θ N , the choice of P 0 can be neglected since it is dominated by the Fisher information matrix I θ .

Illustration of the uncertainties in the predictions
The low-dimensional space of the output from g(x • n ; θN ) is par- ticularly interesting to study when trying to understand how the uncertainty in the parameter estimate θN affects the classification.

Even if the parameter covariance P θ
N is constant and only depends on the training data, the covariance P g N depends on the input x n .Fig. 1 illustrates this via an example where we concentrate our study on the decision between just a subset of the number of classes in the mnist dataset, even though the final decision is over all classes.More generally, for an input x • n that is located in a dense region in the space of the training data, the covariance P g N is small, but for an input x • n that is very far from the training data in some norm, the covariance P g N can be quite large.This indicates that the parameter estimate is quite sensitive in some directions.Even though the estimate of the pmf looks similar, by studying the unnormalized prediction g(x • n ; θN ) it is clear that the prediction in the top has a higher uncertainty compared to the bottom one.

Results on quantifying the uncertainty
Six different methods to quantify the uncertainty in the classification, i.e., to estimate p(y • n = m|x • n ), were evaluated.These are: (i) Standard method, i.e., p(y ), but with the covariance P g N in (17) scaled with a factor T c .
The parameters T and T c are estimated such that ece is minimized on validation data.In Table 1, the accuracy, ll, Brier score, and ece are shown for six different methods evaluated both using the mnist and cfar10 datasets.Table 1 shows that the proposed method attains the lowest ece for both datasets, while still having reasonably good performance in terms of accuracy, ll, and Brier score.Neither computing the uncertainty in the prediction using the softmax (i), deep ensembles (iii), mc-dropout (iv), nor the proposed method without scaled covariance (v) gives calibrated estimates of the uncertainty.To get well-calibrated estimates of the uncertainty either the proposed method with scaled covariance (vi) or temperature scaling (ii) should be used.However, increasing the scaling factor decreases the ll.Hence, there is a trade-off between high ll and low ece.In Fig. 2, the reliability diagram for the calibrated and non-calibrated versions of the proposed method is shown.Here one can see that the parameter T c is necessary in order to obtain a calibrated uncertainty.For ood detection, in Table 2 and Fig. 3, one can see that the proposed method is superior compared to temperature scaling (which is the only other calibrated method), while having similar performance to a deep ensemble whose performance is the best.However, recall that deep ensemble requires training of 50 models, hence, having significantly higher computational complexity compared to the proposed method.
On a standard laptop (Intel core i7), computing (17c) takes 930 s where the sampling in ( 18) is in comparison negligible (100 samples in 5 s).This is compared to sampling the parameters and then performing a forward pass per sample (such as in mcdropout and deep ensemble) where only the forward pass takes 50 s.This is without considering that methods such as deep ensemble require training multiple nn.

Summary and conclusion
A method to estimate the uncertainty in classification performed by an nn has been proposed.The method also enables information fusion in applications where predictions from nns are used as well as statistical risk assessment.The proposed method is based on a local linear approach and consists of two steps.In the first step, an approximation of the posterior distribution of the estimated nn parameters is calculated.This is done using a Laplacian approximation where the covariance of the parameters is calculated recursively using the structure of the Fisher information matrix.In the second step, an estimate of the pmf is calculated where the effect of the uncertainty in the estimated parameters is considered using marginalization over the posterior distribution of the parameter estimate.This is done

Table 1
Computed performance measure for the two datasets.The arrows indicate whether a high or low value is preferable.Reliability diagrams for prediction on the mnist dataset using the proposed method in (18) with and without the scaling parameter T c .In black is a calibration line.
Gal and Ghahramani (2016)ensemble method inGal and Ghahramani (2016); 50 samples of the parameters are used to create the ensemble; (v) Proposed method, i.e., p(y • n n = m|x • n