Epistemic Uncertainty Quantiﬁcation in Deep Learning Classiﬁcation by the Delta Method

The Delta method is a classical procedure for quantifying epistemic uncertainty in statistical models, but its direct application to deep neural networks is prevented by the large number of parameters P . We propose a low cost variant of the Delta method applicable to L 2 -regularized deep neural networks based on the top K eigenpairs of the Fisher information matrix. We address eﬃcient computation of full-rank approximate eigendecompositions in terms of either the exact inverse Hessian, the inverse outer-products of gradients approximation or the so-called Sand-wich estimator. Moreover, we provide a bound on the approximation error for the uncertainty of the predictive class probabilities. We observe that when the smallest eigenvalue of the Fisher information matrix is near the L 2 -regularization rate, the approximation error is close to zero even when K (cid:28) P . A demonstration of the methodology is presented using a TensorFlow implementation, and we show that meaningful rankings of images based on predictive uncertainty can be obtained for two LeNet-based neural networks using the MNIST and CIFAR-10 datasets. Further, we observe that false positives have on average a higher predictive epistemic uncertainty than true positives. This suggests that there is supplementing information in the uncertainty measure not captured by the classiﬁcation alone.


Introduction
The predictive probabilities at the output layer of neural network classifiers are often misinterpreted as model (epistemic) uncertainty [6]. Bayesian statistics provides a coherent framework for representing uncertainty in neural networks [20,8], but has not so far gained widespread use in deep learning -presumably due to the high computational cost that traditionally comes with second-order methods. Recently, [6] developed a theoretical framework which casts dropout at test time in deep neural networks as approximate Bayesian inference. Due to its mathematical elegance and negligible computational cost, this work has caught great interest in a variety of different fields [1,19,38], but has also generated questions as to what types of uncertainty these approximations ac-tually lead [26,27] and what types are relevant [13]. For a general treatment of uncertainty in machine learning, we refer to [12].
Epistemic uncertainty is commonly understood as the reducible component of uncertainty -the uncertainty of the model itself, or its parameters. In our context this amounts to the uncertainty in the estimated class probabilities due to limited amount of training data. While the epistemic uncertainty can be reduced by increasing the amount of training data, the other component of uncertainty known as aleatoric uncertainty, is irreducible and stems from the uncertainty in the label assignment process [32]. However, in this paper we only address the epistemic part, and treat the labels as constant when estimating uncertainty.
Our approach goes back to the work of [20], and we show that the above reasoning leads to the method known as the Delta method 1 [11,24,14] in statistics. However, as the Delta method depends on the empirical Fisher information matrix which grows quadratically with the number of neural network parameters P -its direct application in modern deep learning is prohibitively expensive. We therefore propose a low cost variant of the Delta method applicable to L 2 -regularized deep neural networks based on the top K eigenpairs of the Fisher information matrix. We address efficient computation of full-rank approximate eigendecompositions in terms of either the exact inverse Hessian, the inverse outer-products of gradients (OPG) approximation or the so-called Sandwich estimator. Further, we exhibit the fact that deep learning classifiers tend to be heavily over-parameterized. This leads to flat Fisher information eigenvalue spectra which we show can be exploited in terms of a simple linearization.
The theoretical Fisher information matrix is always positive (semi)-definite, and we constrain our empirical counterpart to be the same. Recent research [29,30,2,7], consistent with our own observations, show that the exact Hessian after training is rarely positive definite in deep learning. To mitigate this, we propose a simple correction of the right tail of the Hessian eigenvalue spectrum to achieve positive definiteness. We corroborate our choice with two observations: a) negative eigenvalues of the Hessian matrix are highly stochastic across different weight initialization values, and b) correcting the eigenvalue spectrum to achieve positive definiteness yields stable predictive epistemic uncertainty estimates which are perfectly correlated with the estimates based on the OPG approximation -which by construction is always positive (semi)-definite [21].
As the computational cost of the exact inverse Hessian matrix or its full eigendecomposition is prohibitively expensive in deep learning, we propose to use the Lanczos iteration [36] in combination with Pearlmutter's technique [28] to compute the needed eigenpairs. Consequently, the matrix inversion will be straightforward, and the net computational complexity will be O(SP N ) time and O(KP ) space, where N is the number of training examples and S is the number of Lanczos-Pearlmutter steps required to compute K eigenpairs.
Also the inverse OPG approximation or its full eigendecomposition is prohibitively costly in deep learning. Even if we disregard the inversion and the quadratic space complexity, one is first left to compute and store the N × Pdimensional Jacobian matrix. In deep learning software provisions based on backward-mode automatic differentiation, only the sum of mini-batch gradients can be computed efficiently. We therefore propose to compute mini-batches of the Jacobian using efficient per-example gradients [25] in combination with incremental singular value decompositions [18]. Since the OPG approximation can be written as a Jacobian matrix product, its eigenvectors will be the right singular vectors of the Jacobian, and its eigenvalues the squared singular values. This leads to a computational complexity of O(KP N ) time and O(KP ) space, also accounting for the inversion. The Sandwich estimator requires both the inverse Hessian and the OPG approximation, and is thus O(max{K, S}P N ) time and O(KP ) space.
This work is a continuation of [25], and we here introduce the fully deterministic [23] open sourced TensorFlow module pydeepdelta [33], and illustrate the methodology on two LeNet-based convolutional neural network classifiers using the MNIST and CIFAR-10 datasets.
The paper is organized as follows: In Section 2 we give definitions which will be used throughout the paper. In Section 3 we review the Delta method in a deep learning classification context, and in Section 4 we outline the details of the proposed methodology. In Section 5 we demonstrate the method, and finally, in Section 6 we summarize the paper and give some concluding remarks and ideas of future work.

Deep Neural Networks
We use a feed-forward neural network architecture with dense layers to introduce terminology and symbols, but emphasize that the theory presented in the paper is directly applicable to any L 2 -regularized architecture.

Architectural
A feed-forward neural network is shown in Figure 1. There are L layers l = 1, 2, ..., L with T l neurons in each layer. The input layer l = 1, is represented by the input vector x n = x n,1 x n,2 . . . x n,T1 T where n = 1, 2, ..., N is the input index. Furthermore, there are L−2 dense hidden layers, l = 2, 3, ..., L−1, and a dense output layer l = L, each represented by weight matrices W (l−1) ∈ R T l ×T l−1 , bias vectors b (l) ∈ R T l and vectorized activation functions a (l) .

Parameter Vectors
The total number of parameters in the model shown in Figure 1 can be written, where P (l) denotes the number of parameters in layer l. By definition, P (1) = 0 since the input layer contains no weights or biases. Furthermore, we define parameter vectors representing the layer-wise weights and biases as follows, for l = 2, 3, . . . , L, with components ω (l) i , i = P (l−1) + 1, P (l−1) + 2, . . . , P (l) . The notation vec(W ) denotes a row-wise vectorization 2 of the matrix W A×B into a column vector of dimension R AB . In the rest of the paper, we consider the full model and define the parameter vector, . . .

Training, Model and Cost Function
The model function f : R T1×P → R T L associated to the architecture shown in Figure 1 is defined as We use a softmax cross-entropy cost function C : R P → R and require L 2regularization with a rate factor λ > 0, where y n represents the target vector for the nth example (N examples), and whereŷ n = f (x n , ω) represents the corresponding prediction vector obtained by evaluating the model function (4) using the input vector x n and the parameter vector (3). The activation function a (L) : R T L → R T L in the output layer is the vectorized softmax function defined as where exp(·) denotes the vectorized exponential function. Training of the neural network can be defined as finding an 'optimal' parameter vectorω by minimizing the cost function (5),ω = arg min C(ω) 3 The Delta Method The Delta method [11] views a modern deep neural network as a (huge) nonlinear regression. In our classification setting, we regard the labels as constant, and thus the epistemic component of the uncertainty associated with predictions of an arbitrary input example x 0 reduces to the evaluation of the covariance matrix of the network outputs [14]. By a first-order Taylor expansion [10], it can be shown that the covariance matrix of the network outputsŷ 0 , i.e. the model function (4), can be approximated by where is the Jacobian matrix of the model function, and where Σ is the covariance matrix of the model parameter estimateω. For a given x 0 , an approximate standard deviation ofŷ 0 is provided by the formula Equation (10) means that when the neural network predicts for an input x 0 , the associated epistemic uncertainty per class output is determined by a linear combination of parameter sensitivity (e.g. F ) and parameter uncertainty (e.g. Σ). Parameter sensitivity (F ) prescribes the amount of change in the neural network output for an infinitesimal change in the parameter estimates, whereas the parameter uncertainty (Σ) prescribes the amount of uncertainty in the parameter estimates themselves. We apply and compare three different approximations to Σ. The first one is called the Hessian estimator, and is defined by where H is the empirical Hessian matrix of the cost function evaluated atω. The second estimator is called the Outer-Products of Gradients (OPG) estimator and is defined by where the summation part of G corresponds to the empirical covariance of the gradients of the cost function evaluated atω. Finally, the third estimator is known as the Sandwich estimator [5,31] and is defined by Across various fields and contexts, the two famous equations (11) and (12) are often presented and interpreted differently, and the inconsistency in the vast literature is nothing but intriguing. We therefore feel that their appearance in this paper requires some elaboration. Firstly, for the Hessian estimator (11), we note that the differentials act only on the data dependent part of the cost function (5), C n , so the second term, λI, here comes from the second-order derivatives of the L 2 -regularization term. Secondly, for the OPG estimator (12), also here the differentials act on the data dependent part of the cost function, but the crucial detail often confused or let out in the literature comes with the second term, λI: under L 2 -regularization it must be added explicitly in order for G to be asymptotically equal to H (See the Appendix 8 for a proof) -as is the primary motivation of the OPG estimator as a plug-in replacement of the Hessian estimator in the first place. If let out, G will almost always be singular [37,22], and thus cannot be used in (12).
At this point, we can see that two fundamental difficulties arise when applying the Delta method in deep learning: a) the sheer size of the covariance matrix grows quadratically with P , and 2) the covariance matrix must be positive definite. In other words, we are virtually forced to compute and store the full covariance matrix, and are in terms of the Hessian estimator dependent on that the optimizer can find a true local (or global) minimum of the cost function. Nevertheless, with the OPG and the Sandwich estimators, the second obstacle is virtually inapplicable since they by definition always will be positive definite when λ > 0.
In the next section we present methodology that addresses both these aspects. We present an indirect correction leaving the Hessian estimator positive definite, and introduce methodology with computational time and space complexity which is linear in P .

The Delta Method in Deep Learning
We present our approach to the Delta method in deep learning as a procedure carried out in two phases after the neural network has been trained. See Figure  2. The first phase -the 'initial phase' -is carried out only once, with the scope of indirectly computing full-rank, positive definite approximations of the covariance matrices (11), (12) or (13) based on approximate eigendecompositions of H and G. The second phase -the 'prediction phase' -is carried out hand in hand with the regular neural network prediction process (4), and is used to approximate the epistemic component of the predictive uncertainty governed by (10) using the indirect covariance matrix approximation found in the 'initial phase'.
In the next sections, we address the following aspects of the proposed methods: a) how to efficiently compute eigenvalues and eigenvectors of the Hessian estimator via the Lanczos iteration and exact Hessian vector products, b) how to efficiently compute eigenvalues and eigenvectors of the OPG estimator via incremental singular value decompositions, c) how to combine the former two to obtain an approximation of the Sandwich estimator, and d) how to apply these estimators to efficiently compute an approximation of (10).

Computing Eigenvalues and Eigenvectors of the Covariance Matrix
The full eigendecomposition of the covariance matrix in (10) is defined by where Q ∈ R P ×P is the matrix whose kth column is the eigenvector q k of Σ, and Λ ∈ R P ×P is the diagonal matrix whose elements are the corresponding eigenvalues, Λ kk = λ k . We assume that the eigenvalues are algebraically sorted so that λ 1 ≥ λ 2 ≥ . . . ≥ λ P . Note that in terms of the Hessian estimator, the eigenvalues are precisely the second derivatives of the cost function along the principal axes of the ellipsoids of equal cost, and that Q is a rotation matrix which defines the directions of these principal axes [16]. For the Hessian estimator (11), the Lanczos iteration [36] can be applied to find K < P eigenvalues (and corresponding eigenvectors) in O(SN P ) time and O(KP ) space when Pearlmutter's technique [28] is applied inside the iteration [25]. Pearlmutter's technique can simply be described as a procedure based on two-pass back-propagations of complexity O(N P ) time and O(P ) space to obtain exact Hessian vector products without requiring to keep the full Hessian matrix in memory. The number S denotes the number of Lanczos iterations to reach convergence. We observe that the convergence of the Lanczos algorithm is quite fast in our experiments, and we find that S is practically orders of magnitude less than P .
For the OPG estimator (12), a slightly different approach can be applied. Since the OPG estimator can be written as a Jacobian matrix product [25], we get by the singular value decomposition that its eigenvectors will be the right singular vectors of the Jacobian, and its eigenvalues the squared singular values. Mini-batches of the Jacobian matrix can easily be obtained by standard back-propagation, and so an incremental singular value decomposition [18,4] can be applied to each mini-batch. The computational cost is thus O(KN P ) time and O(KP ) space. The Sandwich estimator combines the Hessian and the OPG approximation via the product (13), and thus has a computational complexity of O(max{K, S}N P ) time and O(KP ) space. The computational complexity of the outlined methodology is summarized in Table 1 3 .
Our TensorFlow module pydeepdelta [33] utilizes the Lanczos implementation available in the SciPy distribution [35], as well as the incremental singular value decomposition available in the scikit-learn distribution [34].

The Eigenvalue Spectra of H and G
To better understand the proposed covariance approximations, we first need to explore the prototypical deep learning eigenvalue spectrum of the empirical Hessian matrix H (11) and the empirical covariance of the gradients G (12). To this end, we introduce two LeNet-based convolutional neural network classifiers using the MNIST and CIFAR-10 datasets, and draw parallels to the findings in the literature.

Classifier Architectures, Parameters and Training
The MNIST classifier has L = 6 layers, layer l = 1 is the input layer represented by the input vector. Layer l = 2 is a 3 × 3 × 1 × 32 convolutional layer followed by max pooling with stride equal to 2 and with a ReLU activation function. Layer l = 3 is a 3 × 3 × 32 × 64 convolutional layer followed by max pooling with a stride equal to 2, and with ReLU activation function. Layer l = 4 is a 3 × 3 × 64 × 64 convolutional layer with ReLU activation function. Layer l = 5 is a 576 × 64 dense layer with ReLU activation function, and the output layer l = 6 is a 64 × T L dense layer with softmax activation function, where the number of classes (outputs) is T L = 10. The total number of parameters is P = 93322.
The CIFAR-10 classifier has L = 6 layers, layer l = 1 is the input layer represented by the input vector. Layer l = 2 is a 3 × 3 × 3 × 32 convolutional layer followed by max pooling with stride equal to 2 and with a ReLU activation function. Layer l = 3 is a 3 × 3 × 32 × 64 convolutional layer followed by max pooling with a stride equal to 2, and with ReLU activation function. Layer l = 4 is a 3 × 3 × 64 × 64 convolutional layer with ReLU activation function. Layer l = 5 is a 1024 × 64 dense layer with ReLU activation function, and the output layer l = 6 is a 64 × 10 dense layer with softmax activation function, where the number of classes (outputs) is T L = 10. The total number of parameters is P = 122570.
We apply random normal weight initialization and zero bias initialization. We use (5) as the cost function with a L 2 -regularization rate λ = 0.01. We utilize the Adam optimizer [15,3] with a batch size of 100, and apply no form of randomized data shuffling. To ensure convergence (e.g. ||∇C(ω)|| 2 ≈ 0) we apply the following learning rate schedules given by the following

The Eigenvalue Spectrum Approximation
The general assumption in deep learning is that H after training is not positive definite and mostly contain eigenvalues close to zero [29,30,2,7,9,37]. The same holds true for G although it by definition must at least be positive semidefinite [21]. However, given the discussion in Section 3, we know that L 2regularization with rate λ/2 has the effect of shifting the eigenvalues of H and G upwards by λ.
To test this hypothesis, we study the K = 1500 algebraically largest and the K = 1500 algebraically smallest eigenvalues of H and G for 16 trained instances of the MNIST network defined in Section 4.2.1. These sixteen networks are thus only distinguished from each other by a different random weight initialization prior to training. The two corresponding log-scale eigenvalue magnitude spectra are shown in Figure 3. Firstly, we note that in the midpoint gaps of the spectra, there are P −2K = 90195 'missing' central eigenvalues which we have not computed. Since the eigenvalues are sorted in decreasing order, all the central eigenvalues must be close to the L 2 -regularization rate λ. We refer to this part of the eigenvalue spectrum as the gap. Secondly, we note that the confidence intervals in the plots are taken across instance space, thus telling how the eigenvalue spectrum change based on the 16 random weight initializations. In both plots, the blue confidence interval tells that the largest eigenvalues of H and G (called left tail) are stable across the 16 trained networks, but the smallest eigenvalues of H are changing dramatically (called right tail, left plot). On the contrary, all the eigenvalues of G are stable. Thirdly, as shown by the green vertical dotted line in the upper plot representing the mean zero-crossing, H is clearly not positive definite -even with L 2 -regularization. The green confidence interval around the zero-crossing shows that the number of negative eigenvalues also change across the networks.
In [9] it was hypothesized that negative Hessian eigenvalues are caused by a discrepancy between the empirical Hessian (e.g. H) and its theoretical counterpart (expected Hessian) in which the summation of (11) is replaced with an expectation so that effectively N → ∞. They showed that as N grows (holdingω fixed), the empirical right tail grows toward λ whereas the rest of the spectrum is stable. Supported by the fact that H and G will be equal in expectation (Appendix 8), the expected Hessian eigenvalue spectrum might be more similar to that of G where all the eigenvalues are greater than equal to λ. In line with these ideas and the empirical evidence presented in Figure 3, we assume that all the smallest eigenvalues of H in the right tail are inherently noisy, and should not be used by the Hessian estimator. Therefore, with reference to Figure 4, for the Hessian estimator, we a) calculate all the eigenpairs in the left tail, b) approximate all the eigenvalues in the gap and c) extrapolate the eigenvalues from the gap into the right tail. The eigenvectors corresponding to the gap and right tail can implicitly be accounted for by orthonormality as discussed in the next section. For the OPG estimator, the same principle applies apart from that the extrapolation inherently becomes a part of the gap subspace approximation because we know that G always is positive definite when λ > 0. Finally, for the Sandwich estimator, we simply apply the aforementioned procedures and estimate the product (13).

Closing the Gap
Based on the observations in the previous section, we now propose a partitioning of the eigendecomposition which reveals that full-rank, positive definite approximations of the Hessian and OPG estimators can be obtained by computing only the eigenpairs corresponding to the K algebraically largest eigenvalues of H and G respectively. Finally, we show how to use these approximations to construct an approximation of the Sandwich estimator.

The Hessian and OPG Estimators
In terms of the Hessian and OPG estimators, the full eigendecomposition of the covariance matrix can be partitioned into three subspaces as shown in Figure  4 Σ This decomposition applies to both Σ H (11) and Σ G (12), and thus we have omitted the superscripts in our notation. In practice, the two merely differs by which of the two matrices H and G the calculated eigenpairs come from. The subscript 'G' denotes the gap subspace which is based on eigenvectors with eigenvalues λ K+1 to λ P −K−1 . Subscript 'L' denotes the left tail subspace and is based on eigenvectors with eigenvalues λ 1 to λ K . Finally, the subscript 'R' denotes the right tail subspace which is based on eigenvectors with eigenvalues λ P −K to λ P . Accordingly, we have that Q L ∈ R P ×K , Λ L ∈ R K×K , Q G ∈ R P ×(P −2K) , Λ G ∈ R (P −2K)×(P −2K) , Q R ∈ R P ×K and Λ R ∈ R P ×K . If λ K ≈ λ we can safely assume that all the eigenvalues in the gap subspace must be close to λ. In line with [9] and the empirical evidence presented in Figure 3, we assume that all the eigenvalues in the right subspace are inherently noisy, and should not be used by the Hessian estimator. Consequently, we assume that also the eigenvalues in the right subspace are approximately equal Figure 4: In terms of its eigenvalue spectrum, the covariance matrix can be partitioned as given by Equation (15): the left tail subspace (eigenpairs computed), the gap subspace (eigenvalues approximated, eigenvectors implicitly found by orthonormality) and the right tail subspace (eigenvalues extrapolated, eigenvectors implictly found by orthonormality).
to λ. Since the OPG estimator is always positive definite when λ > 0, the same assumption also holds true.
With reference to Figure 4, there are now two possible extreme conditions: a) when all the eigenvalues in the gap and right subspaces are set to λ K (blue), or b) when all the eigenvalues in the gap and right subspaces are set to λ (green). By defining λ (purple) as the harmonic mean of λ and λ K , and λ as the midpoint of their reciprocals, it follows that λ −1 ± λ will enclose the interval [λ −1 K , λ −1 ]. The covariance matrix can now be approximated by with a worst-case approximation error ∆ given by such that Σ is bounded by Σ ± ∆. Since Q is an orthonormal basis, we see that it is possible to express (17) and (18) without an explicit need to compute any of the eigenvectors relative to the gap nor right tail subspaces because Inserting (17) into (10) with use of (19), yields the final form of the approximation to the uncertainty associated with prediction of x 0 with a worst-case approximation error δ given by such that σ 2 (x 0 ) is bounded by σ 2 (x 0 ) ± δ.
In terms of standard deviations, the worst-case approximation error of σ(x 0 ) is given by such that σ(x 0 ) is bounded by σ(x 0 ) ± . Lastly, we define an 'uncertainty score' (which we will use later to rank images) by summing the variances per class output (class variance), and then take the square root to get the total uncertainty in standard deviations with the corresponding worst-case approximation error score given by, such that the true quantity is bounded by σ score (x 0 ) ± score . We note that the worst-case approximation errors (21), (22) and (24) are functions of x 0 but we have notationally dropped this from the equations to avoid cluttering. The approximation errors should be thought of as an uncertainty of the predictive uncertainty which accounts for the worst-case loss of not computing the gap subspace explicitly. Since the right tail subspace can be extrapolated when H is not positive definite, the concept of an approximation error for the Hessian estimator must be used carefully. At this point we make a few comments regarding Equation (20). The first term on the right hand side, Q L Λ −1 L Q T L , corresponds to a low-rank approximation of the covariance matrix based on K explicitly computed principal eigenpairs. However, when the second term, λ −1 (I − Q L Q T L ), is added -the approximation becomes full-rank. When accounting for the left and right multiplication of the sensitivity matrix F , the per-class predictive uncertainties of x 0 can be interpreted as weighted sums of the squared sensitivities in the directions expressed by the eigenbasis Q using the inverse eigenvalues as weights. Hence, for the low-rank approximation -regardless of the sensitivity -the contribution to the predictive uncertainty will be zero in directions k > K, whereas for the full-rank approximation -the contribution can still be high. We will come back to this when we discuss out-of-distribution examples in Section 5.

The Sandwich Estimator
The approximation of the Sandwich estimator is defined by We introduce two separate linearization constants for the approximation of the gap (and right tail) subspace of G and H −1 using the harmonic means The approximation of H −1 is thus given by and the approximation of G given by The superscripts 'H' and 'G' are used to distinguish the eigenvectors and eigenvalues of H and G respectively. By inserting (28) and (29) into (25) and working out the product, we define the following eight matrices The uncertainty associated with prediction of x 0 can now be written with the worst-case approximation error given by such that σ 2 (x 0 ) is bounded by σ 2 (x 0 ) ± δ. In terms of standard deviations, the approximation error is readily found by inserting (38) and (39) into (22).

On the Relation Between the Effective Number of Parameters and K
In [20], the so-called effective number of parameters is defined in terms of the eigenvalues of the Hessian matrix. It is noted that directions in parameter space for which the eigenvalues are close to λ do not contribute to the number of good parameter measurements. Therefore, the effective number of parameters is a measure of the number of parameters which are well determined by the training data. In other words, when we select K so that λ K ≈ λ, we loosely cover the data dependent part of the Hessian matrix (first term of right hand side of (5)) and can therefore expect that K will be a crude estimate of the number of effective parameters. As seen by Equations (21) and (39), the approximation error will be zero when the smallest eigenvalue λ K in the left tail subspace (of H and G) is exactly equal to the L 2 -regularization rate λ.

Demonstration and Proof of Concept
In the following Section we explore and demonstrate the approximate predictive epistemic uncertainty estimate governed by (10) for the two LeNet-based neural network classifiers that were introduced in Section 4.2.1. We establish by the use of regressions that the three estimators (11)-(13) yield close to perfectly correlated predictive epistemic uncertainty estimates for both of the classifiers. Figure 5 shows nonparametrically smoothed versions of the predictive epistemic uncertainty for the three proposed estimators against class probability for all the images in the MNIST and CIFAR-10 test sets. Clearly, the three estimators yield close to identical results. Further, we observe that the average predictive epistemic uncertainty associated with false positives (yellow line) is higher than for true positives (blue line). The banana-shaped appearance of these plots suggests that there is a negative quadratic relationship between probability and uncertainty. The explanation for this is attributed to the softmax activation function whose gradient (e.g. sensitivity F ) will always be weighted by a quantity which is negative quadratic in probability (e.g.ŷ (1 −ŷ)). The evolution of the nonparametrically smoothed uncertainty levels and approximation errors for the OPG estimator as functions of the number of computed eigenpairs K and class probability is displayed in Figure 6. As expected, for a growing K, the approximation errors diminish and the uncertainty stabilizes. Although we do not display similar plots for the other two estimators, we note that for MNIST, the approximation errors are smallest for the OPG estimator, followed by the Hessian estimator and the Sandwich estimator. The larger the difference between λ and the smallest eigenvalue λ K , the higher the average approximation error. As seen by the eigenvalue spectra in Figure 3, the drop-off rate towards λ is faster for G, thus explaining why the OPG estimator leads to the lowest approximation errors on MNIST. We note that since the Sandwich estimator is dependent on both the approximation of H and G, its approximation errors are not unexpectedly the highest. Further-more, the fall-off rate towards λ in the eigenvalue spectrum for CIFAR-10 is slightly lower than for MNIST. This means that the CIFAR-10 classifier has a greater number of effective parameters -and thus requires a higher K to achieve acceptable approximation error levels. This fact is evident by Figure 6, where we see that the OPG approximation errors for CIFAR-10 are dropping off to zero slower than for MNIST.

The Distribution of Approximate Predictive Epistemic Uncertainty
For all three estimators, it is evident by Figure 6 that most of the contribution to the predictive epistemic uncertainty comes from the left subspace corresponding to the largest eigenvalues of H and G. This observation can be counter-intuitive since it is the directions with the smallest eigenvalues that will be the largest contributors to the variance when accounting for the inversions in (11), (12) or (13). The explanation for this phenomenon is attributed to the sensitivity F (9). We observe that the training and test set sensitivity drops to zero in directions k for which λ k ≈ λ and is thus canceling with the reciprocals of the smallest eigenvalues in the linear combinations formed by (20) or (38). Nevertheless, as the sensitivity for data not belonging to the same distribution as the training can still be high in these directions, the corresponding predictive epistemic uncertainty can still receive significant contributions from directions k > K. This emphasizes the importance of making the estimators full-rank using the orthonormal basis technique presented in Section 4.3. We add that due to the full-rank property, the number K should be thought of as the number of explicitly computed eigenpairs rather than the number of utilized eigenpairsas the latter will effectively be equal to P .
To illustrate the concept of a low vs. full-rank approximation, Figure 7a displays the uncertainty scores as functions of K for the low and full-rank version of the OPG estimator applied to the out-of-distribution (OoD) example shown in Figure 7b. For reference, we also plot the uncertainty scores for the ten images in the training set with the highest uncertainty scores sorted  in descending order. Comparing the green curve with the blue curve shows that the OoD example has a sensitivity spectrum stretching out far beyond K = 1500 because the low-rank version (blue) has not yet reached the stable level achieved by the full-rank approximation (green) at this K. That the full-rank approximation quickly stabilizes already at around K = 600, can be explained by that it receives contribution from the full spectrum even though only K principal eigenpairs are computed explicitly at each stage. The reference images (black curves) are computing using the full-rank approximation, and are all lower ranked than the OoD example. A detailed comparison of the three estimators is shown in Table 2. By regressing their outcomes against each other, we clearly see that the relative estimated uncertainty levels are near to perfectly correlated since the squared correlations coefficients are close to 1. As seen by the slopes β, only the absolute levels of the estimated uncertainty differ, and since the intercepts α are zero, there are no offsets.

Ranking Images Based on the 'Uncertainty Score'
We propose to validate our results by studying the MNIST and CIFAR-10 images associated with the maximum and minimum amount of total predictive epistemic uncertainty as defined in (23)   The uncertainty score (a) as a function of K for the MNIST OoD example in (b) using the full-rank OPG approximation (green curve) vs. its low-rank counterpart (blue curve) from Equations (20) and (23). The green interval corresponds to the approximation error. The reference images (black curves) are computing using the full-rank approximation, and corresponds to the ten images in the training set with the highest uncertainty scores sorted in descending order.
prisingly, since the squared correlation coefficients in Table 2 are close to 1, the OPG and Sandwich estimators yield almost identical results and are not shown.
The idea is based on the following reasoning: if a neural network classifies an image with low predictive epistemic 'uncertainty score', the image should be easy to classify also for a human. Conversely, if the neural network classifies an image with a high predictive epistemic 'uncertainty score', the image should be hard to classify for a human. Effectively, the predictive epistemic 'uncertainty score' ranks images according to the degree of 'doubt' expressed by the neural network -and by the figures we find striking evidence that this corresponds well with human judgment.

Summary, Concluding Remarks and Further Work
We have presented a computationally tractable framework for traditional Fisher information based uncertainty quantification in deep learning classification. To this end, we have introduced full-rank, positive definite covariance estimators using approximate eigendecompositions in terms of either the Hessian, the OPG approximation or the so-called Sandwich estimator. Further, we have proposed to utilize the Lanczos algorithm in combination with Pearlmutter's technique to compute the needed eigenpairs of the Hessian, and to compute mini-batches of the Jacobian matrix using efficient per-example gradients in combination with incremental singular value decompositions for the OPG approximation. As the  Table 2: Regression comparison of σ H , σ G and σ S across all the images in the MNIST and CIFAR-10 training and test sets. The respective superscripts H, G and S denote Hessian, OPG and Sandwich. The regression intercept, slope and squared correlation coefficient is denoted by α, β and R 2 , respectively.
computational complexity of these methods scale linearly with the number of model parameters, they are therefore suited for deep learning.
We have shown that the three estimators yield close to identical prediction uncertainty estimates when applied on two different LeNet-based neural network classifiers. We have seen that only the top K << P Fisher information matrix eigenpairs contribute significantly to the predictive uncertainty for data in the same distribution as the training set. As this does not necessarily hold true for OoD examples, we have shown that thanks to the full-rank property of the proposed estimators, also these will converge quickly under the same framework.
We have also seen that when images are ranked according to their relative level of predictive epistemic uncertainty, the ordering corresponds well with human judgment: corner cases tend to be highly ranked, and we clearly see why data augmentation is beneficial -since the top ranked images often are prone to unusual perspectives and/or rare colors. Generally, we conjecture that classifiers can benefit from operating in the joint probability-uncertainty domain. As a corroborative example we have empirically shown that false positives appears to have an average higher prediction uncertainty than true positives.
Looking forward, we point at several specific areas of research which could be investigated. The first candidate is to establish how the Fisher information eigenspectrum of very large networks and datasets behave. If the contraction of the spectrum towards λ continues to be fast with growing network and dataset sizes, the methodology presented can be tractable even for the most complex models. However, if the largest affordable K yields a λ K far from λ, it can render the methodology intractable in terms of that the approximation errors will be too large. This points to understanding what causes the contraction phase in the first place, and hence uncovering the factors that drive it. Secondly, we leave the discussion regarding which of the three estimators (or other combinations) one should use -and when -as an opportunity for future research. Thirdly, as this work has been focused on the classification task, a natural extension is to see how the framework behaves under deep learning regression [14]. Fourthly, we point at a fundamental issue with the Delta method itself. The Delta method is inevitably based on the local curvature around the parameter estimateω, hence incorporating no means about the uncertainty of the parameter estimate outside this local region. What is lost, and how much, by disregarding the broader perspective of the solution space -a space potentially within reach for sampling methods. Finally, we hope that this contribution and the released software can pave the way for not only uncertainty focused research, but for a broader range of Hessian based research topics in the deep learning domain.

Acknowledgements
Parts of this work have been done in the context of CEDAS (Center for Data Science, University of Bergen, Norway). The lead author would also like to thank Dr. BerentÅ. S. Lunde and Øyvind Lunde Rørtveit for their helpful advice on various technical issues examined in this paper.

Appendix
The cost function C(ω) can be interpreted as the negative log posterior,