An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing

Doicu, Adrian; Doicu, Alexandru; Efremenko, Dmitry S.; Loyola, Diego; Trautmann, Thomas

doi:10.3390/rs13245061

Open AccessArticle

An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing

¹

Remote Sensing Technology Institute, German Aerospace Center (DLR), 82234 Oberpfaffenhofen, Germany

²

Institute of Mathematics, University of Augsburg, 86159 Augsburg, Germany

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(24), 5061; https://doi.org/10.3390/rs13245061

Submission received: 4 November 2021 / Revised: 8 December 2021 / Accepted: 10 December 2021 / Published: 13 December 2021

(This article belongs to the Special Issue Recent Advances in Neural Network for Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we present neural network methods for predicting uncertainty in atmospheric remote sensing. These include methods for solving the direct and the inverse problem in a Bayesian framework. In the first case, a method based on a neural network for simulating the radiative transfer model and a Bayesian approach for solving the inverse problem is proposed. In the second case, (i) a neural network, in which the output is the convolution of the output for a noise-free input with the input noise distribution; and (ii) a Bayesian deep learning framework that predicts input aleatoric and model uncertainties, are designed. In addition, a neural network that uses assumed density filtering and interval arithmetic to compute uncertainty is employed for testing purposes. The accuracy and the precision of the methods are analyzed by considering the retrieval of cloud parameters from radiances measured by the Earth Polychromatic Imaging Camera (EPIC) onboard the Deep Space Climate Observatory (DSCOVR).

Keywords:

neural networks; interval arithmetic; radiative transfer

Graphical Abstract

1. Introduction

In atmospheric remote sensing, the retrieval of atmospheric parameters is an inverse problem that is usually ill posed. Due to its ill posedness, measurement errors can lead to large errors in the retrieved quantities. It is therefore desirable to characterize the retrieved value by an estimate of uncertainty describing a range of values that probably produce a measurement [1].

The retrieval algorithms are mostly based on deterministic or stochastic optimization methods. From the first category, the method of Tikhonov regularization and iterative regularization methods deserve to be mentioned, while from the second category, the Bayesian methods are the most representative. The Bayesian framework [2,3] provides an efficient way to deal with the ill-posedness of the inverse problem and its uncertainties. In this case, the solution of the inverse problem is given by the a posteriori distribution (the conditional distribution of the retrieval quantity given the measurement) that accounts for all assumed retrieval uncertainties. Under the assumptions that (i) the a priori knowledge and the measurement uncertainty are both Gaussian distributions, and (ii) the forward model is moderately nonlinear, the a posteriori distribution is approximately Gaussian with a covariance matrix that can be analytically calculated. However, even neglecting the validity of these assumptions, the method is not efficient for the operational processing of large data volumes. The reason is that the computations of the forward model and its Jacobian are expensive computational processes.

Compared to Bayesian methods, neural networks are powerful tools for the design of efficient retrieval algorithms. Their capability to approximate any continuous function on a compact set to an arbitrary accuracy makes them well suited to approximate the input–output function represented by a radiative transfer model. An additional feature is that the derivatives of a neural network model with respect to its inputs can be analytically computed. Thus, a neural network algorithm can produce, in addition to the approximation of the radiative quantities of interest, also an estimation of their derivatives with respect to the model inputs. While the training of a neural network may require a significant amount of time, a trained neural network may deliver accurate predictions of the forward model and its Jacobian, typically in a fraction of a millisecond.

Neural networks were initially developed for the emulation of radiative transfer models [4,5,6,7,8] and subsequently for atmospheric remote sensing. The latter include:

Neural networks that are used to emulate the forward model and are applied in conjunction with a Bayesian approach to solve the inverse problem [9,10,11,12];
Neural networks that are developed to directly learn the retrieval mappings from data [13,14,15,16,17,18,19,20]; and
Neural networks that are designed to directly invert the atmospheric parameters of interest which are then used as initial values in an optimization algorithm, e.g., the method of Tikhonov regularization [21,22].

In the first case, the total uncertainty sums up the contributions of the uncertainties in the data and in the neural network model, whereby the model uncertainty is computed from the statistics of the errors over the data set. However, the retrieval algorithm is still based on the assumption that the forward model is nearly linear, which is generally not true. In the second case, the probabilistic character of the inverse problem is neglected and uncertainty estimates are provided as mean errors computed over the data set; this is a disadvantage compared to Bayesian methods. Remarkable exceptions to the approaches listed above are the works of Aires [23,24], in which a Bayesian framework in conjunction with the Laplace approximation was used to model the retrieval errors (estimated from the error covariance matrix observed on the data set), and of Pfreundschuh et al. [25], in which a quantile regression neural network was designed. Unfortunately, the Laplace approximation is only suitable for small training sets and simple networks, while quantile regression is suitable for the retrieval of scalar quantities (it is unclear whether a reasonable approximation of the quantile contours of the joint a posteriori distribution can be obtained).

This paper is devoted to an overview of neural networks methods for predicting uncertainty in atmospheric remote sensing. In addition to the method based on a neural network for simulating the radiative transfer model and a Bayesian approach for solving the inverse problem (Case 1 above), methods relying on Bayesian networks are described. These methods, in which the network activations and weights can be modeled by parametric probability distributions, are standard tools for uncertainty predictions in deep neural networks, and can be applied to nonlinear retrieval problems, and to the best knowledge of the authors, have not yet been used in atmospheric remote sensing. The paper, which is merely of pedagogical nature, is organized as follows. In Section 1, we present the theoretical background of this study. This is then used in Section 2 to develop several neural network methods and apply them to a specific cloud parameters retrieval problem. Section 3 contains a few concluding remarks.

2. Theoretical Background

We consider a generic model

y = F (x),

where

x \in R^{N_{x}}

is the input vector,

F

is some deterministic function and

y \in R^{N_{y}}

is the output vector. For an atmosphere characterized by a set of state parameters, the signal measured by an instrument at different wavelengths can be computed by a radiative transfer model

R

. Specifically, we will refer to the measurement signals as data, and split the vector of state parameters in (i) the vector of atmospheric parameters that are intended to be retrieved, and (ii) the vector of atmospheric parameters that are known to have some accuracy, but are not included in the retrieval (forward model parameters). In this study, we will use neural networks to model the radiative transfer function

R

, as well as its inverse

R^{- 1}

. In this regard and in order to simplify the notation, we will consider two cases. In the first case, referred to as the direct problem, the input

x

is the set of atmospheric parameters, the output

y

is the set of data and the forward model

F

coincides with the radiative transfer model

R

, while in the second case, referred to as the inverse problem, the situation is reversed: the input

x

includes the sets of data and forward model parameters, while the output

y

includes the set of atmospheric parameters to be retrieved (in the absence of forward model parameters, the forward model

F

reproduces the inverse of the radiative transfer model

R^{- 1}

).

In machine learning, the task is to approximate

F (x)

by a neural network model

f (x, ω)

characterized by a set of parameters

ω

[26,27]. For doing this, we consider a set of inputs

X = {x^{(n)}}_{n = 1}^{N}

and a corresponding set of outputs

Y = {y^{(n)}}_{n = 1}^{N}

, given by

y^{(n)} = F (x^{(n)})

, where N is the number of samples. In a regression problem,

D = {(x^{(n)}, y^{(n)})}_{n = 1}^{N}

forms a data set—or more precisely, a training set—from which the neural network model

f (x, ω)

can be inferred. Traditional neural networks are comprised of units or nodes arranged in an input layer, an output layer, and a number of hidden layers situated between the input and output layers. Let

L + 1

be the number of layers, and

N_{l}

be the number of units in layer l, where

l = 0, \dots, L

. The input layer corresponds to

l = 0

, the output layer to

l = L

, so that

N_{x} = N_{0}

and

N_{y} = N_{L}

. In feed-forward networks, the signals

y_{i, l - 1}

from units

i = 1, \dots, N_{l - 1}

in layer

l - 1

are multiplied by a set of weights

w_{j i, l}

,

j = 1, \dots, N_{l}

,

i = 1, \dots, N_{l - 1}

, and then summed and combined with a bias

b_{j, l}

,

j = 1, \dots, N_{l}

. This calculation forms the pre-activation signal

u_{j, l} = \sum_{i = 1}^{N_{l - 1}} w_{j i, l} y_{i, l - 1} + b_{j, l}

which is transformed by the layer activation function

g_{l}

to form the activation signal

y_{j, l}

of unit

j = 1, \dots, N_{l}

in layer l. Defining the matrix of weights

W_{l} \in R^{N_{l} \times N_{l - 1}}

and the vector of biases

b_{l} \in R^{N_{l}}

by

{[W_{l}]}_{j i} = w_{j i, l}

and

{[b_{l}]}_{j} = b_{j, l}

, respectively, and letting

ω = {W_{l}, b_{l}}_{l = 1}^{L}

be the set of network parameters, the feed-forward operations can be written in matrix form as

\begin{matrix} y_{0} & = x, \end{matrix}

(1)

\begin{matrix} u_{l} & = W_{l} y_{l - 1} + b_{l}, \end{matrix}

(2)

\begin{matrix} y_{l} & = g_{l} (u_{l}), l = 1, \dots, L, \end{matrix}

(3)

\begin{matrix} f (x, ω) & = y_{L}, \end{matrix}

(4)

where

{[y_{l}]}_{i} = y_{i, l}

and

{[u_{l}]}_{j} = u_{j, l}

. Deep learning is the process of regressing the network parameters

ω

on the data

D

. The standard procedure is to compute a point estimate

\hat{ω}

as the minimizer of some loss function by using the back-propagation algorithm [28]. In a stochastic framework, the loss function is usually defined as the log likelihood of the data set, with an eventual regularization term to penalize the network parameters. From a statistical point of view, this procedure is equivalent to a maximum a posteriori (MAP) estimation when regularization is used, and maximum likelihood estimation (MLE) when this is not the case.

In this section, we review the basic theory which serves as a basis for the development of different neural network architectures. In particular, we describe (i) the methodology for computing point estimates; (ii) the different types of uncertainty; and (iii) Bayesian networks.

2.1. Point Estimates

A data model with output noise is given by

\begin{matrix} y & = f (x, ω) + δ_{y}, \end{matrix}

(5)

\begin{matrix} δ_{y} & \sim N (0, C_{y}^{δ}), \end{matrix}

(6)

where here and in the following, the notation

N (x; \bar{x}, C_{x})

, or more simply and when no confusion arises,

N (\bar{x}, C_{x})

stands for a Gaussian distribution with the mean

\bar{x}

and covariance matrix

C_{x}

. When the true input

x

is hidden (so that it cannot be observed) but samples from a random vector

z = x + δ_{x}

with

δ_{x} \sim N (0, C_{x}^{δ})

, i.e.,

p (z | x) = N (z; x, C_{x}^{δ})

are available, the pertinent model is the data model with input and output noise, that is:

\begin{matrix} y & = f (z, ω) + Δ_{y}, \end{matrix}

(7)

\begin{matrix} Δ_{y} & \sim N (0, C_{y}^{δ}) . \end{matrix}

(8)

The error

Δ_{y}

sums up the contributions of the output error and of the input error that propagates through the network in the output space. Specifically, when the noise process in the input space is small and the linearization:

\begin{matrix} f (x, ω) & = f (z, ω) + K_{x} (z, ω) (x - z), \end{matrix}

(9)

\begin{matrix} K_{x} (z, ω) & = \frac{\partial f}{\partial x} (z, ω), \end{matrix}

(10)

is assumed, we find (cf. Equations (5), (7), and (9))

Δ_{y} = δ_{y} + K_{x} (z, ω) (x - z)

, and further:

C_{y}^{δ} (z, ω) = C_{y}^{δ} + K_{x} (z, ω) C_{x}^{δ} K_{x}^{T} (z, ω) .

(11)

To design a neural network, we consider a data set

D

associated to each data model, meaning that:

An exact data set $D = {(x^{(n)}, y^{(n)})}_{n = 1}^{N}$ , where $y^{(n)} = F (x^{(n)})$ ;
A data set with output noise $D = {(x^{(n)}, y^{(n)})}_{n = 1}^{N}$ , where $y^{(n)} = F (x^{(n)}) + δ_{y}$ ; and
A data set with input and output noise $D = {(z^{(n)}, y^{(n)})}_{n = 1}^{N}$ , where $z^{(n)} = x^{(n)} + δ_{x}$ and $y^{(n)} = F (x^{(n)}) + δ_{y}$ .

In a stochastic framework, a neural network can be regarded as a probabilistic model

p (y | z, ω)

; given an observable input

z

and a set of parameters

ω

, a neural network assigns a probability to each possible output

y

. In view of Equations (7) and (8), the a priori confidence in the predictive power of the model is given by

p (y | z, ω) = N (y; f (z, ω), C_{y}^{δ} (z, ω)) .

(12)

The process of learning from the data

D

can be described by the posterior

p (ω | D) = p (ω | Z, Y)

, which represents the Bayes plausibility for the parameters

ω

given the data

D

. This can be estimated by using the Bayesian rule:

\begin{matrix} p (ω | D) & = \frac{p (D | ω) p (ω)}{p (D)} \propto p (D | ω) p (ω) \propto exp [- E (ω)], \end{matrix}

(13)

where

p (D | ω)

is the likelihood or the probability of the data,

p (ω)

is the prior over the network parameters,

p (D) = \int p (D | ω) p (ω) d ω

the evidence, and:

E (ω) = E_{D} (ω) + E_{R} (ω)

(14)

is the loss function. The first term

E_{D} (ω)

in the expression of the loss function

E (ω)

is the contribution from the likelihood

p (D | ω)

, written as the product (cf. Equation (12)):

\begin{matrix} p (D | ω) & = p (Y | Z, ω) = \prod_{n = 1}^{N} p (y^{(n)} | z^{(n)}, ω) \propto exp [- E_{D} (ω)], \end{matrix}

(15)

\begin{matrix} E_{D} (ω) & = \frac{1}{2} \sum_{n = 1}^{N} {[y^{(n)} - f (z^{(n)}, ω)]}^{T} {[C_{y}^{δ} (z^{(n)}, ω)]}^{- 1} [y^{(n)} - f (z^{(n)}, ω)], \end{matrix}

(16)

while the second term

E_{R} (ω)

is the contribution from the prior

p (ω)

, chosen for example, as the Gaussian distribution:

\begin{matrix} p (ω) & = N (ω; 0, C_{ω}) \propto exp [- E_{R} (ω)], \end{matrix}

(17)

\begin{matrix} E_{R} (ω) & = \frac{1}{2} ω^{T} C_{ω}^{- 1} ω . \end{matrix}

(18)

In this regard, point estimates with regularization are computed by maximizing the posterior

p (ω | D)

:

\begin{matrix} \hat{ω} = ω_{MAP} & = arg max_{ω} log p (ω | D) = arg min_{ω} E (ω), \end{matrix}

(19)

while point estimates without regularization are computed by maximizing the likelihood

p (D | ω)

:

\begin{matrix} \hat{ω} = ω_{MLE} & = arg max_{ω} log p (D | ω) = arg min_{ω} E_{D} (ω) . \end{matrix}

(20)

Some comments are in order.

For a data model with output noise, we have $z = x,$ $C_{y}^{δ} = C_{y}^{δ}$ , and $D = {(x^{(n)}, y^{(n)})}_{n = 1}^{N}$ with $y^{(n)} = F (x^{(n)}) + δ_{y}$ . Moreover, for the covariance matrix $C_{y}^{δ} = σ_{y}^{2} I$ , where $I$ is the identity matrix, we find:

$E_{D} (ω) = \sum_{n = 1}^{N} \frac{1}{2 σ_{y}^{2}} | | y^{(n)} - f (z^{(n)}, ω) {| |}^{2},$

(21)

or more precisely:

$E_{D} (ω) = \sum_{n = 1}^{N} [\frac{1}{2 σ_{y}^{2}} | | y^{(n)} - f (z^{(n)}, ω) {| |}^{2} + \frac{N_{y}}{2} log σ_{y}^{2}],$

(22)

Assuming $C_{ω} = σ_{ω}^{2} I$ and using Equation (21), we infer that the point estimate $\hat{ω} = ω_{MAP}$ is the minimizer of the Tikhonov function:

$\begin{matrix} E (ω) & = \frac{1}{2} \sum_{n = 1}^{N} | | y^{(n)} - f (x^{(n)}, ω) {| |}^{2} + {α | | ω | |}^{2}, \end{matrix}$

(23)

where $α = σ_{y}^{2} / (2 σ_{ω}^{2})$ is the regularization parameter.
A model with exact data can be handled by considering the data model with output noise and by making $σ_{y}^{2} \approx 0$ in the representation of the data error covariance matrix $C_{y}^{δ} = σ_{y}^{2} I$ . For $σ_{y}^{2} \approx 0$ , the relation $y^{(n)} = F (x^{(n)}) + δ_{y}$ yields $y^{(n)} \approx F (x^{(n)})$ , while Equation (23) and the relation $α = σ_{y}^{2} / (2 σ_{ω}^{2}) \approx 0$ gives:

$\begin{matrix} \hat{ω} & \approx ω_{MLE} = arg min_{ω} E (ω), \\ E (ω) & = \frac{1}{2} \sum_{n = 1}^{N} | | y^{(n)} - f (x^{(n)}, ω) {] | |}^{2} . \end{matrix}$

Thus, when learning a neural network with the exact data, the maximum likelihood estimate minimizes the sum of square errors. Note that in this case, regularization is not absolutely required, because the output data are exact.
For a data model with input and output noise, the computation of the estimate $\hat{ω}$ is not a trivial task, because the covariance matrix $C_{y}^{δ} (z, ω),$ which enters in the expression of $E_{D} (ω)$ , depends on $ω$ . Moreover, in Equation (11), $C_{y}^{δ} (z, ω)$ corresponds to a linearization of the neural network function under the assumption that the noise process in the input space is small. This problem can be solved by implicitly learning the covariance matrix $C_{y}^{δ} (z, ω)$ from the loss function [29]. Specifically, we assume that $C_{y}^{δ} (z, ω)$ is a diagonal matrix with entries $σ_{j}^{2} (z, ω)$ , that is:

$C_{y}^{δ} (z, ω) = diag {[σ_{j}^{2} (z, ω)]}_{j = 1}^{N_{y}},$

(24)

implying:

$log p (y | z, ω) \propto - \sum_{j = 1}^{N_{y}} \{\frac{1}{2 σ_{j}^{2} (z, ω)} {[y_{j} - μ_{j} (z, ω)]}^{2} + \frac{1}{2} log σ_{j}^{2} (z, ω)\} .$

(25)

Here, we identified $μ \equiv f$ , and set $μ_{j} = {[μ]}_{j}$ and $y_{j} = {[y]}_{j}$ . To learn the variances $σ_{j}^{2} (z, ω)$ from the loss function, we use a single network with input $z$ , and output $[μ_{j} (z, ω), σ_{j}^{2} (z, ω)] \in R^{2 N_{y}}$ ; thus, in the output layer, $N_{y}$ units are used to predict $μ_{j}$ and $N_{y}$ units to predict $σ_{j}^{2}$ . In practice, to increase numerical stability, we train the network to predict the log variance $ρ_{j} = log σ_{j}^{2}$ , in which case, the likelihood loss function is:

$\begin{matrix} E_{D} (ω) & = \sum_{n = 1}^{N} E_{D}^{(n)} (ω), \end{matrix}$

(26)

with:

$\begin{matrix} E_{D}^{(n)} (ω) & = \frac{1}{2} \sum_{j = 1}^{N_{y}} {exp [- ρ_{j} (z^{(n)}, ω)] {[y_{j}^{(n)} - μ_{j} (z^{(n)}, ω)]}^{2} + ρ_{j} (z^{(n)}, ω)} . \end{matrix}$

(27)

It should be pointed out that from Equation (25) that it is apparent that the model is stopped from predicting high uncertainty through the $log σ_{j}^{2}$ term, but also low uncertainty for points with high residual error, as low $σ_{j}^{2}$ will increase the contribution of the residual. On the other hand, it should also be noted that a basic assumption of the model is that the covariance matrix $C_{y}^{δ} (z, ω)$ is diagonal, which unfortunately, is contradicted by Equation (11).
In order to generate a data set with input and output noise, that is, $D = {(z^{(n)}, y^{(n)})}_{n = 1}^{N}$ , where $z^{(n)} = x^{(n)} + δ_{x}$ , $δ_{x} \sim N (0, C_{x}^{δ})$ , and $y^{(n)} = F (x^{(n)}) + δ_{y}$ , we used the jitter approach. According to this approach, at each forward pass through the network, a new random noise $δ_{x}$ is added to the original input vector $x^{(n)}$ . In other words, each time a training sample is passed through the network, random noise is added to the input variables, making them different every time it is passed through the network. By this approach, which is a simple form of data augmentation, the dimension of the data set is reduced (actually, the data set is $D = {(x^{(n)}, y^{(n)})}_{n = 1}^{N}$ with $y^{(n)} = F (x^{(n)}) + δ_{y}$ ).

2.2. Uncertainties

In our analysis, we are interested in estimating the uncertainty associated with the underlying processes. The quantity which exactly quantifies the model’s uncertainty is the predictive distribution of an unknown output

y

given an observable input

z

, defined by

p (y | z, D) = \int p (y | z, ω) p (ω | D) d ω .

(28)

If

p (y | z, D)

is known, the first two moments of the output

y

can be computed as

\begin{matrix} E (y) & = \int y p (y | z, D) d y, \end{matrix}

(29)

\begin{matrix} E (y y^{T}) & = \int y y^{T} p (y | z, D) d y, \end{matrix}

(30)

and the covariance matrix as

Cov (y) = E (y y^{T}) - E (y) E {(y)}^{T} .

(31)

From Equation (28), we see that the predictive distribution

p (y | z, D)

can be computed if the Bayesian posterior

p (ω | D)

is known. Unfortunately, computing the distribution

p (ω | D)

by means of Equation (13) is usually an intractable problem, because computing the evidence

p (D) = \int p (D | ω) p (ω) d ω

is not a trivial task. To address this problem, either:

The Laplace approximation, which yields an approximate representation for the posterior $p (ω | D);$ or
A variational inference approach, which learns a variational distribution $q_{θ} (ω)$ to approximate the posterior $p (ω | D)$ ,

can be used. In the first case, we are dealing with deterministic (point estimate) neural networks, in which a single realization of the network parameters

ω

is learned, while in the second case, we are dealing with stochastic neural networks, in which a distribution over the network parameters

ω

is learned.

The Laplace approximation is of theoretical interest because it provides explicit representations for the different types of uncertainty. In Appendix A it is shown that under the Laplace approximation, the predictive distribution is given by [23,24,30]

\begin{matrix} p (y | z, D) & \propto exp \{- \frac{1}{2} {[y - f (z, \hat{ω})]}^{T} C_{y}^{- 1} (z, \hat{ω}) [z - f (z, \hat{ω})]\}, \end{matrix}

(32)

\begin{matrix} C_{y} (z, \hat{ω}) & = C_{y}^{δ} (z, \hat{ω}) + C_{y}^{e} (z, \hat{ω}), \end{matrix}

(33)

\begin{matrix} C_{y}^{e} (z, \hat{ω}) & = K_{ω} (z, \hat{ω}) H^{- 1} (\hat{ω}) K_{ω}^{T} (z, \hat{ω}), \end{matrix}

(34)

where:

K_{ω} (z, ω) = \frac{\partial f}{\partial ω} (z, ω)

(35)

is the Jacobian of

f

with respect to

ω

and:

{[H (\hat{ω})]}_{i j} = \frac{\partial E}{\partial ω_{i} \partial ω_{j}} (\hat{ω}),

(36)

is the Hessian matrix of the loss function. Equation (32) provides a Gaussian approximation to the predictive distribution

p (y | z, D)

with mean

f (z, \hat{ω})

and covariance matrix

C_{y} (z, \hat{ω})

. From Equation (33), we see that the covariance matrix

C_{y} (z, \hat{ω})

has two components.

The first component $C_{y}^{δ} (z, \hat{ω})$ is the covariance matrix in the distribution over the error in the output $y$ . This term, which is input dependent, describes the so-called aleatoric heteroscedastic uncertainty measured by $p (y | z, ω)$ . For a data model with output noise, we have (cf. Equation (11)) $C_{y}^{δ} = C_{y}^{δ}$ , and this term, which is input independent, describes the so-called aleatoric homoscedastic uncertainty measured by $p (y | x, ω)$ .
The second component $C_{y}^{e} (z, \hat{ω})$ reflects the uncertainty induced in the weights $ω$ , also called epistemic uncertainty or model uncertainty. The sources of this uncertainty measured by $p (ω | D)$ are for example: (i) non-optimal hyperparameters (the number of hidden layers, the number of units per layer, the type of activation function); (ii) non-optimal training parameters (the minimum learning rate at which the training is stopped, the learning rate decay factor, the batch size used for the mini-batch gradient descent); and (iii) a non-optimal optimization algorithm (ADAGRAD, ADADELTA, ADAM, ADAMAX, NADAM).

or model uncertainty, which refers to the fact that we do not know the model that best explains the given data. For NNs, this is uncertainty from not knowing the best values of weights in all the trainable layers. This is often referred to as reducible uncertainty, because we can theoretically reduce this type of uncertainty by acquiring more data.

Some comments can be made here.

The heteroscedastic covariance matrix $C_{y}^{δ} (z, ω) = diag {[σ_{j}^{2} (z, ω)]}_{j = 1}^{N_{y}}$ can be learned from the data by minimizing the loss functions (26) and (27).
The computation of the epistemic covariance matrix $C_{y}^{e} (z, \hat{ω})$ from Equation (34) requires the knowledge of the Hessian matrix $H (\hat{ω})$ . In general, this problem is practically insoluble because the matrix $H$ is very huge, and so, is very difficult to compute. However, the problem can be evaded by using the diagonal Hessian approximation

${[H (\hat{ω})]}_{i j} = δ_{i j} \frac{\partial^{2} E}{\partial ω_{i}^{2}} (\hat{ω}),$

where $δ_{i j}$ is the Kronecker delta. The diagonal matrix elements can be very efficiently computed by using a procedure which is similar to the back-propagation algorithm used for computing the first derivatives [31].
The covariance matrix $C_{y} (z, \hat{ω})$ can be approximated by the conditional average covariance matrix $E (C_{y} | D)$ of all network errors $ε = y - f (z, \hat{ω})$ over the data set $D$ , meaning that:

$E (C_{y} | D) = \frac{1}{N} \sum_{n = 1}^{N} [ε^{(n)} - E (ε)] {[ε^{(n)} - E (ε)]}^{T},$

where $ε^{(n)} = y^{(n)} - f (z^{(n)}, \hat{ω})$ and $E (ε) = (1 / N) \sum_{n = 1}^{N} ε^{(n)}$ . Note that this is a very rough approximation, because in contrast to $C_{y} (z, \hat{ω})$ , $E (C_{y} | D)$ does not depend on $z$ .
For exact data, we have $C_{y} (x, \hat{ω}) = B C_{y}^{e} (x, \hat{ω})$ . Thus, when learning a neural network with exact data, the predictive distribution $p (y | x, D)$ is Gaussian with mean $f (x, \hat{ω})$ and (epistemic) covariance matrix $C_{y}^{e} (x, \hat{ω})$ .

2.3. Bayesian Networks

Stochastic neural networks are a type of neural networks built by introducing stochastic components into the network (activation or weights). A Bayesian neural network can be regarded as a stochastic neural network trained by using Bayesian inference [32]. This type of neural network provides a natural approach to quantify uncertainty in deep learning and allows to distinguish between epistemic and aleatoric uncertainties.

Bayesian neural networks equipped with variational inference bypass the computation of the evidence

p (D) = \int p (D | ω) p (ω) d ω

, which determines the Bayesian posterior

p (ω | D)

via Equation (13), by learning a variational distribution

q_{θ} (ω)

to approximate

p (ω | D)

, that is:

q_{θ} (ω) \approx p (ω | D),

(37)

where

θ

are some variational parameters. These parameters are computed by minimizing the Kullback–Leibler (KL) divergence:

KL (q_{θ} (ω) | p (ω | D)) = \int q_{θ} (ω) log [\frac{q_{θ} (ω)}{p (ω | D)}] d ω,

(38)

with respect to

θ

. In fact, the KL divergence is a measure of similarity between the approximate distribution

q_{θ} (ω)

and the posterior distribution obtained from the full Gaussian process

p (ω | D)

, and minimizing the KL divergence to be the same as minimizing the variational free energy defined by

\begin{matrix} F (θ, D) & = - \int q_{θ} (ω) log p (D | ω) d ω + KL (q_{θ} (ω) | p (ω)) \\ = \int q_{θ} (ω) [log q_{θ} (ω) - log p (ω) - log p (D | ω)] d ω . \end{matrix}

(39)

Considering the data model with output noise, assuming

C_{y}^{δ} = σ_{y}^{2} I

, and replacing

p (ω | D)

by

q_{θ} (ω)

in Equation (28), which gives an approximate predictive distribution:

\begin{matrix} p (y | x, D) & \approx q_{θ} (y | x) = \int p (y | x, ω) q_{θ} (ω) d ω \end{matrix}

(40)

which can be approximated at test time by

q_{θ} (y | x) \approx \frac{1}{T} \sum_{t = 1}^{T} p (y | x, ω_{t}),

(41)

where

ω_{t}

is sampled from the distribution

q_{θ} (ω)

. As a result, for

p (y | x, ω) = N (y, f (x, ω),

C_{y}^{δ} = σ_{y}^{2} I)

, the predictive mean and the covariance matrix of the output

y

given the input

x

, can be approximated, respectively, by (Appendix B):

\begin{matrix} E (y) & \approx \frac{1}{T} \sum_{t = 1}^{T} f (x, ω_{t}), \end{matrix}

(42)

\begin{matrix} Cov (y) & \approx σ_{y}^{2} I + \frac{1}{T} \sum_{t = 1}^{T} f (x, ω_{t}) f {(x, ω_{t})}^{T} - E (y) E {(y)}^{T} . \end{matrix}

(43)

The first term

σ_{y}^{2} I

in Equation (43) corresponds to the homoscedastic uncertainty (the amount of noise in the data), while the second part corresponds to the epistemic uncertainty (how much the model is uncertain in its prediction). The predictive mean (42) is known as model averaging, and is equivalent to performing T stochastic passes through the network and averaging the results. Note that for exact data, i.e.,

σ_{y}^{2} \approx 0

, the covariance matrix simplifies to:

\begin{matrix} Cov (y) & \approx \frac{1}{T} \sum_{t = 1}^{T} f (x, ω_{t}) f {(x, ω_{t})}^{T} - E (y) E {(y)}^{T} \end{matrix}

(44)

and describes the epistemic uncertainty.

In this section, we present the most relevant algorithm that is used for Bayesian inference (Bayes-by-backprop), as well as two Bayesian approximate methods. For this purpose, we consider a data model with output noise and assume

C_{y}^{δ} = σ_{y}^{2} I

; in this case,

p (D | ω)

is given by Equation (15) in conjunction with Equations (21) or (22).

2.3.1. Bayes by Backpropagation

Bayes-by-backprop is a practical implementation of variational inference combined with a reparameterization trick [33]. The idea of the parameterization trick is to introduce a random variable

ϵ

, and to determine

ω

by a deterministic transformation

t (ϵ, θ)

, such that

ω = t (ϵ, θ)

follows the distributions

q_{θ} (ω)

. If the variational posterior

q_{θ} (ω)

is a Gaussian distribution with a diagonal covariance matrix, i.e.,

q_{θ} (ω) = N (ω; μ_{ω}, diag {[σ_{ω j}^{2}]}_{j = 1}^{W})

, where

σ_{ω j} = {[σ_{ω}]}_{j}

and

W = dim (ω) = dim (μ_{ω}) = dim (σ_{ω})

, a sample of the weight

ω

can be obtained by sampling a unit Gaussian

ϵ \sim N (0, I)

, scaling it by a standard deviation

σ_{ω}

and by shifting a mean

μ_{ω}

. Actually, to guarantee that

σ_{ω}

is always non-negative, the standard deviation can be parametrized pointwise as

σ_{ω} = σ_{ω} (ρ_{ω}) = log (1 + exp (ρ_{ω}))

or as

σ_{ω} = σ_{ω} (ρ_{ω}) = exp (ρ_{ω} / 2)

(i.e.,

ρ_{ω j} = log σ_{ω j}^{2}

). Thus, the variational posterior parameters are

θ = (μ_{ω}, ρ_{ω})

. By using a Monte Carlo sampling with one sample, we compute the variational free energy (39) by using the relations:

\begin{matrix} ϵ & \sim N (0, I), \\ ω & = μ_{ω} + σ_{ω} (ρ_{ω}) \circ ϵ, \\ F (θ, D) & = log q_{θ} (ω) - log p (ω) - log p (D | ω), \end{matrix}

where ∘ denotes point-wise multiplication. In Ref. [33], the prior over the weights

p (ω)

is chosen as a mixture of two Gaussian with zero mean but differing variances, meaning that:

p (ω) = β N (ω; 0, σ_{1}^{2} I) + (1 - β) N (ω; 0, σ_{2}^{2} I),

(45)

while in [34], it is shown that, for

q_{θ} (ω) = N (ω; μ_{ω}, diag [σ_{ω j}^{2}])

and

p (ω) = N (ω; 0, I)

, the KL divergence

KL (q_{θ} (ω) | p (ω))

can be analytically computed; the result is:

\begin{matrix} KL (q_{θ} (ω) | p (ω)) & = \int q_{θ} (ω) [log q_{θ} (ω) - log p (ω)] d ω \\ = \frac{1}{2} \sum_{j = 1}^{W} [μ_{ω j}^{2} + σ_{ω j}^{2} - log (σ_{ω j}^{2}) - 1], \end{matrix}

(46)

where

μ_{ω j} = {[μ_{ω}]}_{j}

,

σ_{ω j} = {[σ_{ω}]}_{j}

. Thus:

F (θ, D) = - log p (D | ω) + \frac{1}{2} \sum_{j = 1}^{W} [μ_{ω j}^{2} + σ_{ω j}^{2} - log (σ_{ω j}^{2}) - 1] .

(47)

After the training stage, i.e., after the variational posterior parameters

θ = (μ_{ω}, ρ_{ω})

have been learned, we consider the set of samples

{ω_{t}}_{t = 1}^{T}

, where

ω_{t} = μ_{ω} + σ_{ω} (ρ_{ω}) \circ ϵ_{t}

and

ϵ_{t}

is sampled from the Gaussian distribution

N (0, I)

, and compute the predictive mean and covariance matrix according to Equations (42) and (43).

2.3.2. Dropout

Dropout, which was initially proposed as a regularization technique during the training [35,36], is equivalent to a Bayesian approximation. This result was proved in [37] by showing that the variational free energy

F (θ, D)

has the standard form representation of the dropout loss function (as the sum of a square loss function and a

L_{2}

regularization term). A simplified proof of this assertion is given in Appendix C. The term “dropout” refers to removing a unit along with all its connections. The choice of a dropped unit is random. In the case of dropout, the feed-forward operation of a standard neural network (2) and (3) becomes:

\begin{matrix} u_{l} & = W_{l} Z_{l - 1} y_{l - 1} + b_{l}, \end{matrix}

(48)

\begin{matrix} y_{l} & = g_{l} (u_{l}), \end{matrix}

(49)

where:

\begin{matrix} Z_{l - 1} & = diag {[z_{k, l - 1}]}_{k = 1}^{N_{l - 1}}, \end{matrix}

(50)

\begin{matrix} z_{k, l - 1} & \sim Bernoulli (p) . \end{matrix}

(51)

Essentially, the output of the unit k in layer

l - 1,

y_{k, l - 1}

is multiplied by the binary variable

z_{k, l - 1}

to create the new output

z_{k, l - 1} y_{k, l - 1}

. The binary variable

z_{k, l - 1}

takes the value 1 with probability p and the value 0 with probability

1 - p

; thus, if

z_{k, l - 1} = 0

, the new output is zero and the unit k is dropped out as an input to the next layer l. The same values of the binary variable are used in the backward pass when propagating the derivatives of the loss function. In the test time, the weights

W_{l}

are scaled as

p W_{l}

. Thus, we retain units with probability p at training time, and multiply the weights by p at test time. Alternatively, in order to maintain constant outputs at training time, we can scale the weights by

1 / p

, and do not modify the weights at test time. The model parameters

ω = {W_{l}, b_{l}}_{l = 1}^{L}

are obtained by minimizing the loss function (21) with possibly an

L_{2}

regularization term.

Uncertainty can be obtained from a dropout neural network [37]. For the set of samples

{ω_{t}}_{t = 1}^{T}

, where

ω_{t}

corresponds to a realization from the Bernoulli distribution

Z_{l - 1}^{(t)}

and

W_{l}^{(t)} = W_{l} Z_{l - 1}^{(t)}

for all

l = 1, \dots, L

, the predictive mean and covariance matrix are computed by means of Equations (42) and (43).

2.3.3. Batch Normalization

Ioffe and Szegedy [38] introduced batch normalization as a technique for training deep neural networks that normalizes (standardizes) the inputs to a layer for each mini-batch. This allows for higher learning rates, regularizes the model, and reduces the number of training epochs. Moreover, Teye et al. [39] showed that a batch normalized network can be regarded as an approximate Bayesian model, and it can thus be used for modeling uncertainty.

Let us split the data set

D

into

N_{b}

mini-batches, and let the nth mini-batch

B^{(n)}

contain M pair samples

(x^{(n, m)}, y^{(n, m)})

, meaning that:

\begin{matrix} D & = \cup_{n = 1}^{N_{b}} B^{(n)}, \\ B^{(n)} & = {(x^{(n, m)}, y^{(n, m)})}_{m = 1}^{M} . \end{matrix}

In batch normalization, the input

u_{j, l}^{(n, m)} = \sum_{i = 1}^{N_{l - 1}} w_{j i, l} y_{i, l - 1}^{(n, m)}

of the activation function

g_{l}

corresponding to unit j, layer l, sample m, and mini-batch n, which is first normalized:

{\tilde{u}}_{j, l}^{(n, m)} = \frac{u_{j, l}^{(n, m)} - μ_{j, l}^{(n)}}{\sqrt{v_{j, l}^{(n)} + ε}},

(52)

and then scaled and shifted:

{\hat{u}}_{j, l}^{(n, m)} = α_{j, l} {\tilde{u}}_{j, l}^{(n, m)} + β_{j, l},

(53)

where:

μ_{j, l}^{(n)} = \frac{1}{M} \sum_{m = 1}^{M} u_{j, l}^{(n, m)} and v_{j, l}^{(n)} = \frac{1}{M} \sum_{m = 1}^{M} {(u_{j, l}^{(n, m)} - μ_{j, l}^{(n)})}^{2}

(54)

are the mean and variance of the activation inputs over the M samples (in unit j, layer l and mini-batch n), respectively,

α_{j, l}

and

β_{j, l}

are learnable model parameters, and

ε

is a small number added to the mini-batch variance to prevent division by zero. By normalization (or whitening),

{\tilde{u}}_{j, l}^{(n, m)}

becomes a random variable with zero mean and unit variance, while by scaling and shifting, we guarantee that the transformation (53) can represent the identity transform. In a stochastic framework, we interpret the mean

μ_{j, l}^{(n)}

and variance

v_{j, l}^{(n)}

, corresponding to the nth mini-batch

B^{(n)}

, as realizations of the random variables

μ_{j, l}

and

v_{j, l}

, corresponding to the data set

D

. The model parameters include the learnable parameters:

θ = {w_{j i, l}, α_{j, l}, β_{j, l} | j = 1, \dots, N_{l}, i = 1, \dots, N_{l - 1}, l = 1, \dots, L},

and the stochastic parameters:

ω = (μ, v) = {(μ_{j, l}, v_{j, l}) | j = 1, \dots, N_{l} l = 1, \dots, L},

where for the mini-batch

B^{(n)}

:

ω^{(n)} = (μ^{(n)}, v^{(n)}) = {(μ_{j, l}^{(n)}, v_{j, l}^{(n)}) | j = 1, \dots, N_{l} l = 1, \dots, L}

is a realization of

ω = (μ, v)

. Optimizing over mini-batches of size M instead on the full training set, the objective function for the mini-batch

B^{(n)}

becomes [39]:

F^{(n)} (θ) = \frac{1}{2 M} \sum_{m = 1}^{M} | | y^{(n, m)} - f_{ω^{(n)}} (x^{(n, m)}, θ) {| |}^{2} + Ω (θ),

(55)

where

Ω (θ) = α \sum_{l = 1}^{L} {∥W_{l}∥}_{2}^{2}

with

{[W_{l}]}_{j i} = w_{j i, l}

is the regularization term and the notation

f_{ω^{(n)}} (x, θ)

indicates that the mean and variance

ω^{(n)} = (μ^{(n)}, v^{(n)})

are used in the normalization step (52). At the end of the training stage, we obtain:

The maximum a posteriori parameters $\hat{θ} = θ_{MAP}$ ;
The mean and variance realizations ${ω^{(n)} = (μ^{(n)}, v^{(n)})}_{n = 1}^{N_{b}}$ of the stochastic parameters $ω = (μ, v)$ ; and
The moving averages of the mean and variance over the training set $\bar{ω} = (\bar{μ} = E (μ), \bar{v} = E (v))$ .

Some comments can be made here.

A batch-normalized network samples the stochastic parameters $ω$ once per training step (mini-batch). For a large number of epochs, the $ω^{(n)}$ become independent and identical distributed random variables for each training example;
The variational distribution $q_{θ} (ω)$ corresponds to the joint distribution of the weights induced by the stochastic parameters $ω$ ;
The equivalence between a batch-normalized network and a Bayesian approximation was proven in [39] by showing that (cf. Equations (39) and (55)) $\partial KL (q_{θ} (ω) | p (ω)) /$ $\partial θ = (N / σ_{y}^{2}) \partial Ω (θ) / \partial θ$ . The proof relies on the following simplified assumptions: (i) no scale and shift transformations; (ii) batch normalization applied to each layer; (iii) independent input features in each layer; and (iv) large N and M.

In the inference, the output for a given input

x

is

f_{\bar{ω}} (x, \hat{θ})

. To estimate the predictive mean and covariance matrix, we proceed as follows. For each

t = 1 \dots, T

, we sample a mini-batch

B^{(t)}

from the data set

D = \cup_{n = 1}^{N_{b}} B^{(n)}

, and for the corresponding mean and variance realization

ω^{(t)} \in {ω^{(n)}}_{n = 1}^{N_{b}}

, compute

f_{ω^{(t)}} (x, \hat{θ})

and then the predictive mean and covariance matrix from Equations (42) and (43) with

f_{ω^{(t)}} (x, \hat{θ})

in place of $f (x, ω_{t})$ .

3. Neural Networks for Atmospheric Remote Sensing

In this section, we design several neural networks for atmospheric remote sensing. To test the neural networks, we considered a specific problem, namely the retrieval of cloud optical thickness and cloud top height from radiances measured by the Earth Polychromatic Imaging Camera (EPIC) onboard the Deep Space Climate Observatory (DSCOVR). DSCOVR is placed in a Lissajous orbit about the Lagrange-1 point, and provides a unique angular perspective in an almost backward direction with scattering angles between

168^{\circ}

and

176^{\circ}

. The EPIC instrument has 10 spectral channels ranging from the UV to the near-IR, which include four channels around the oxygen A- and B-bands; two absorption channels are centered at 688 nm and 764 nm with bandwiths of 0.8 nm and 1.0 nm, respectively, while two continuum bands are centered at 680 nm and 780 nm with bandwiths of 2.0 nm. These four channels are used for inferring the cloud parameters. To generate the database, we use a radiative transfer model based on the discrete ordinate method with matrix exponential [40,41] that uses several acceleration techniques, as for example, the telescoping technique, the method of false discrete ordinate [42], the correlated k-distribution method [43], and the principal component analysis [44,45,46]. The atmospheric parameters to be retrieved are the cloud optical thickness

τ

and the cloud top height H, while the forward model parameters are the solar and viewing zenith angles and the surface albedo. Specifically, we consider the fact that the cloud optical thickness varies in the range

4.0, \dots, 16.0

, the cloud top height in the range

2.0, \dots 10.0

km, the solar and viewing zenith angles in the range

0^{\circ}, \dots, 60^{\circ}

, and the surface albedo in the range

0.02, \dots, 0.2

(only snow/ice free scenes are considered). The simulations are performed for a water-cloud model with a Gamma size distribution

p (a) \propto a^{α} exp (- α a / a_{\mod})

with parameters

a_{\mod} = 8

μ

m and

α = 6

. The droplet size ranges between 0.02 and 50.0

μ

m, the cloud geometrical thickness is 1 km and the relative azimuth angle between the solar and viewing directions is

176^{\circ}

. The

O_{2}

absorption cross-sections are computed using LBL calculations [47] with optimized rational approximations for the Voigt line profile [48]. The wavenumber grid point spacing is chosen as a fraction (e.g.,

1 / 4

) of the minimum half-width of the Voigt lines taken from HITRAN database [49]. The Rayleigh cross-section and depolarization ratios are computed as in [50], while the pressure and temperature profiles correspond to the US standard model atmosphere [51]. The radiances are solar-flux normalized and are computed by means of the delta-M approximation in conjunction with the TMS correction. We generate

N =

20,000 samples by employing the smart sampling technique [52]. Among this data set, 18,000 samples were used for training and the other 2000 for prediction. The noisy spectra are generated by using the measurement noise

δ_{mes} \sim N (0, diag {[σ_{mes j}^{2}]}_{j = 1}^{4})

, where for the jth channel, we use

σ_{mes j} = 0.1 {\bar{I}}_{j}

with

{\bar{I}}_{j}

being the average of the simulated radiance over the N samples.

The neural network algorithms are implemented in FORTRAN by using a feed-forward multilayer perceptron architecture. The tool contains a variety of optimization algorithms, activation functions, regularization terms, dynamic learning rates, and stopping rules. For the present application, the number of hidden layers and the number of units in each layer are optimized by using 2000 samples from the training set for validation. To estimate the performances of different hyperparameter configurations, we used the holdout cross-validation together with a grid search over a set of three values for the number of hidden layers, i.e.,

{1, 2, 3}

, and a set of eight values for the number of units, i.e., {25, 50, 75, 100, 125, 150, 175, 200}. The mini-batch gradient descent in conjunction with Adaptive Moment Estimation (ADAM) [53] is used as optimization tool, a mini batch of 100 samples is chosen, and a ReLU activation function is considered.

3.1. Neural Networks for Solving the Direct Problem

For a direct problem, the input

x

is the set of atmospheric parameters, the output

y

is the set of data, and the forward model

F

reproduces the radiative transfer model

R

.

We consider a neural network trained with exact data. For the predictive distribution

p (y | x, D)

given by Equation (32), we assume that the epistemic covariance matrix

C_{y}^{e} (x, \hat{ω})

(= C_{y} (x, \hat{ω}))

is computed from the statistics of

ε = y - f (x, \hat{ω})

, that is,

C_{y}^{e} \approx E (C_{y} | D)

, where

f (x, \hat{ω})

is the network output. Furthermore, let

y^{δ} = y + δ_{y}

with

δ_{y} \sim N (0, C_{y}^{δ}),

be the noisy data vector. Using the result:

p (y^{δ} | y) \propto exp [- \frac{1}{2} (y^{δ} - y)^{T} {(C_{y}^{δ})}^{- 1} (y^{δ} - y)],

(56)

we compute the predictive distribution for the noisy data

p (y^{δ} | x, D)

by marginalization, that is:

\begin{matrix} p (y^{δ} | x, D) & = \int p (y^{δ} | y) p (y | x, D) d y \\ = \int exp \{- \frac{1}{2} {[y^{δ} - f (x, \hat{ω}) + Δ y]}^{T} {[C_{y}^{e} (\hat{ω})]}^{- 1} [y^{δ} - f (x, \hat{ω}) + Δ y]\} \\ \times exp [- \frac{1}{2} Δ y^{T} {(C_{y}^{δ})}^{- 1} Δ y] d Δ y \\ \propto exp \{- \frac{1}{2} {[y^{δ} - f (x, \hat{ω})]}^{T} C_{y}^{- 1} (\hat{ω}) [y^{δ} - f (x, \hat{ω})]\}, \end{matrix}

(57)

where (cf. Equation (33))

C_{y} (\hat{ω}) = C_{y}^{e} (\hat{ω}) + C_{y}^{δ}

and

Δ y = y - y^{δ}

.

In the next step, we solve the inverse problem

y^{δ} = f (x, \hat{ω})

by means of a Bayesian approach [54,55]. In this case, the posteriori density

p (x | y^{δ}, D)

is given by

p (x | y^{δ}, D) = \frac{p (y^{δ} | x, D) p (x)}{p (y)};

(58)

whence, by assuming that the state vector

x

is a Gaussian random vector with mean

x_{a}

and the covariance matrix

C_{x}

, meaning that:

\begin{matrix} p (x) & = N (x; x_{a}, C_{x}) \propto exp [- \frac{1}{2} (x - x_{a}) C_{x}^{- 1} {(x - x_{a})}^{T}], \end{matrix}

(59)

we obtain:

log p (x | y^{δ}, D) = - \frac{1}{2} V (x | y^{δ}) + C,

(60)

where:

\begin{matrix} V (x | y^{δ}) & = [y^{δ} - f (x, \hat{ω})] C_{y}^{- 1} (\hat{ω}) {[y^{δ} - f (x, \hat{ω})]}^{T} + (x - x_{a}) C_{x}^{- 1} {(x - x_{a})}^{T} \end{matrix}

(61)

is the a posteriori potential and C is constant. The maximum a posteriori estimator, defined by

\hat{x} = x_{MAP} = arg max_{x} log p (x | y^{δ}, D) = arg min_{x} V (x | y^{δ}),

(62)

can be computed by any optimization method, as for example, the Gauss–Newton method. If the problem is almost linear, the a posteriori density

p (x | y^{δ}, D)

is Gaussian with the mean

\hat{x}

and covariance matrix [54]:

C_{x} (\hat{x}, \hat{ω}) = {[K_{x}^{T} (\hat{x}) C_{y}^{- 1} (\hat{ω}) K_{x} (\hat{x}) + C_{x}^{- 1}]}^{- 1} .

(63)

It should be pointed out that we can train the neural network for a data set with output noise

D = {(x^{(n)}, y^{δ (n)})}_{n = 1}^{N}

, in which case, the covariance matrix

C_{y} (\hat{ω})

can be directly computed from the statistics of

ε = y^{δ} - f (x, \hat{ω})

.

Essentially, the method involves the following steps:

Train a neural network with exact data for simulating the radiative transfer model;
Compute the epistemic covariance matrix from the statistics of all network errors over the data set;
Solve the inverse problem by a Bayesian approach under the assumption that the a priori knowledge and the measurement uncertainty are both Gaussian;
Determine the uncertainty in the retrieval by assuming that the forward model is nearly linear.

In Figure 1, we plot the histograms of the relative error over the prediction set:

\begin{matrix} ε_{x} & = \frac{x_{pred} - x}{x}, \end{matrix}

where x stands for

τ

and H, and

x_{pred}

and x are the predicted and true values, respectively. Also shown here are the plots of

x_{pred}

and

x_{pred} \pm 3 σ_{x}

versus x. The results demonstrate that the cloud optical thickness is retrieved with better accuracy than the cloud top height, and that for the cloud top height, the predicted uncertainty are especially unrealistic, because the condition:

x_{pred} - 3 σ_{x} \leq x \leq x_{pred} + 3 σ_{x}

is not satisfied. The reason for this failure might be that the forward model is not nearly linear.

3.2. Neural Networks for Solving the Inverse Problem

For an inverse problem, the input

x

includes the sets of data and forward model parameters (solar and viewing zenith angles, and surface albedo), while the output

y

includes the set of atmospheric parameters to be retrieved (cloud optical thickness and cloud top height).

3.2.1. Method 1

Let

f (x, \hat{ω})

be the output of a neural network trained with exact data, and assume that the predictive distribution

p (y | x, D)

given by Equation (32), and in particular, the epistemic covariance matrix

C_{y}^{e} (x, \hat{ω}) (= C_{y} (x, \hat{ω}))

are available. For the noisy input

z = x + δ_{x}

with

p (z | x) = N (z; x, C_{x}^{δ} = σ_{x}^{2} I)

and under the assumption that the prior

p (x)

is a slowly varying function as compared to

p (z | x),

the predictive distribution of the network output can be approximated by [56]

\begin{matrix} p (y | z, D) & = \int p (y | x, D) p (x | z) d x \\ = \frac{1}{p (z)} \int p (y | x, D) p (z | x) p (x) d x \\ \approx \int p (y | x, D) p (z | x) d x . \end{matrix}

(64)

Thus, if

p (x)

varies much more slowly than

p (z | x) = p (z - x)

, we assume that

p (y | z, D)

is the convolution of the predictive distribution for an uncorrupted input

p (y | x, D)

with the input noise distribution

p (z - x)

. In the noise-free case, that is, if

σ_{x} \to 0

, we have

{lim}_{σ_{x} \to 0} p (z | x) = δ (z - x)

, yielding

p (y | z, D) = p (y | x, D)

. Using Equation (64), we obtain:

\begin{matrix} E (y | z) & = \int y p (y | z, D) d y \\ = \int (\int y p (y | x, D) d y) p (z | x) d x \\ = \int f (x, \hat{ω}) p (z | x) d x \end{matrix}

(65)

and:

\begin{matrix} E (y y^{T} | z) & = \int y y^{T} p (y | z, D) d y \\ = \int (\int y y^{T} p (y | x, D) d y) p (z | x) d x \\ = \int [C_{y}^{e} (x, \hat{ω}) + f (x, \hat{ω}) f {(x, \hat{ω})}^{T}] p (z | x) d x . \end{matrix}

(66)

Equations (65) and (66) show that in general

E (y | z) \neq f (x, \hat{ω})

, and a noise process blurs or smooths the original mappings.

To compute the predictive mean

E (y | z)

and the covariance matrix

Cov (y | z)

, the integrals in Equations (65) and (66) must be calculated. This can be done by Monte Carlo integration (by sampling the Gaussian distribution

p (z | x)

) or by a quadrature method. In the latter case, we consider a uniform grid

{x_{i}}_{i = 1}^{N_{int}}

in the interval, say

[z - 2 σ_{x}, z + 2 σ_{x}]

, define the weights:

v_{i} = \frac{exp (- \frac{1}{2 σ_{x}^{2}} | | z - x_{i} {| |}^{2})}{\sum_{i = 1}^{N_{int}} exp (- \frac{1}{2 σ_{x}^{2}} | | z - x_{i} {| |}^{2})},

(67)

and use the computational formulas:

\begin{matrix} E (y | z) & = \sum_{i = 1}^{N_{int}} v_{i} f (x_{i}, \hat{ω}), \end{matrix}

(68)

and:

\begin{matrix} Cov (y | z) & = \sum_{i = 1}^{N_{int}} v_{i} C_{y}^{e} (x_{i}, \hat{ω}) + \sum_{i = 1}^{N_{int}} v_{i} f (x_{i}, \hat{ω}) f {(x_{i}, \hat{ω})}^{T} - E (y | z) E {(y | z)}^{T} . \end{matrix}

(69)

The neural network trained with exact data can be deterministic or stochastic, as for example, Bayes-by-backprop, dropout, and batch normalization. In this regard, for each noisy input

z

, we consider a uniform grid

{x_{i}}_{i = 1}^{N_{int}}

around

z

, calculate the weights

v_{i}

by means of Equation (67), and for each

x_{i}

, compute

C_{y}^{e} (x_{i}, \hat{ω})

as follows:

For a deterministic network, we approximate $C_{y}^{e} (x_{i}, \hat{ω})$ by the conditional average covariance matrix $E (C_{y}^{e} | D)$ of all network errors over, that is, $C_{y}^{e} \approx E (C_{y}^{e} | D)$ , yielding $\sum_{i = 1}^{N_{int}} v_{i} C_{y}^{e} (x_{i}, \hat{ω}) = E (C_{y}^{e} | D)$ ;
For a Bayes-by-backprop and dropout networks, we compute $C_{y}^{e} (x_{i}, \hat{ω})$ by means of Equation (44) with $E (y)$ as in Equation (42);
For a batch normalized network, we compute $C_{y}^{e} (x_{i}, \hat{ω})$ as

$\begin{matrix} Cov (y) & = \frac{1}{T} \sum_{t = 1}^{T} f_{ω^{(t)}} (x_{i}, \hat{θ}) f_{ω^{(t)}} {(x_{i}, \hat{θ})}^{T} - E (y) E {(y)}^{T} \end{matrix}$

with $E (y) = (1 / T) \sum_{t = 1}^{T} f_{ω^{(t)}} (x_{i}, \hat{θ})$ .

Note that for a Bayes-by-backprop network, the output is

f (x_{i}, \hat{ω} = {\hat{μ}}_{ω})

, for a dropout network, the output

f (x_{i}, \hat{ω})

is computed without dropout, and for a batch normalized network, the output is

f_{\bar{ω}} (x_{i}, \hat{θ})

.

In summary, the method uses:

Deterministic and stochastic networks trained with exact data to compute the epistemic covariance matrix; and
The assumption that the predictive distribution of the network output is the convolution of the predictive distribution for an uncorrupted input with the input noise distribution to estimate the covariance matrix.

Under the assumption that the noise process is Gaussian, the convolution integrals are computed by a quadrature method with a uniform grid around the noisy input.

The results for Method 1 with deterministic and stochastic networks are illustrated in Figure 2, Figure 3, Figure 4 and Figure 5.

3.2.2. Method 2

In order to model both heteroscedastic and epistemic uncertainties, we used the approach described in Ref. [57]. More precisely, we considered the data model with input and output noise, and used dropout to learn the heteroscedastic covariance matrix

C_{y}^{δ} (z, ω) = diag {[σ_{j}^{2} (z, ω)]}_{j = 1}^{N_{y}}

from the data (see Section 2.1). The network has the output

[μ_{j} (z, ω), σ_{j}^{2} (z, ω)] \in R^{2 N_{y}}

and is trained to predict the log variance

ρ_{j} = log σ_{j}^{2}

, in which case, the likelihood loss function is given by Equations (26) and (27). Considering the set of samples

{ω_{t}}_{t = 1}^{T}

, where

ω_{t}

corresponds to a realization of the Bernoulli distribution

Z_{l - 1}^{(t)}

and

W_{l}^{(t)} = W_{l} Z_{l - 1}^{(t)}

for all

l = 1, \dots, L

, we compute the predictive mean and covariance matrix as [57]

\begin{matrix} E (y) & \approx \frac{1}{T} \sum_{t = 1}^{T} μ (z, ω_{t}), \end{matrix}

(70)

\begin{matrix} Cov (y) & \approx \frac{1}{T} \sum_{t = 1}^{T} diag {[σ_{j}^{2} (z, ω_{t})]}_{j = 1}^{N_{y}} + \frac{1}{T} \sum_{t = 1}^{T} μ (z, ω_{t}) μ {(z, ω_{t})}^{T} - E (y) E {(y)}^{T}, \end{matrix}

(71)

for each noisy input

z

. The first term in Equation (71) reproduces the heteroscedastic uncertainty, while the second and third terms reproduce the epistemic uncertainty.

Instead of a dropout network, a Bayes-by-backprop network can also be used to learn the heteroscedastic covariance matrix from the data. In this case,

F (θ, D)

is given by Equation (47) with

log p (D | ω) = - E_{D} (ω) = - \sum_{n = 1}^{N} E_{D}^{(n)} (ω)

and

E_{D}^{(n)} (ω)

as in Equation (27).

The results for Method 2 with a dropout and a Bayes-by-backprop network are illustrated in Figure 6 and Figure 7.

3.2.3. Method 3

Let

\hat{ω} = {W_{l}, b_{l}}_{l = 1}^{L}

be the parameters of a dropout network trained with exact data. Further, assume that the input data are noisy, i.e.,

z = x + δ_{x}

with

p (z | x) = N (z; x, C_{x}^{δ} = diag {[σ_{x k}^{2}]}_{k = 1}^{N_{x}})

. In order to compute the heteroscedastic uncertainty, we forward propagate the input noise through the network. This is done by using assumed density filtering and interval arithmetic.

Assumed density filtering (ADF). This approach was applied to neural networks by Gast and Roth [58] to replace each network activation by probability distributions. In the following, we provide a simplified justification of this approach, while for a more pertinent analysis, we refer to Appendix D. For a linear layer ( $g_{l} (x) = x)$ , the feed-forward operation (without dropout) is:

$\begin{matrix} y_{k, l} & = \sum_{j = 1}^{N_{l - 1}} w_{k j, l} y_{j, l - 1} + b_{k, l}, k = 1, \dots, N_{l}, \end{matrix}$

(72)

By straightforward calculation, we find:

$μ_{k, l} = E (y_{k, l}) = \sum_{j = 1}^{N_{l - 1}} w_{k j, l} E (y_{j, l - 1}) + b_{k, l} = \sum_{j = 1}^{N_{l - 1}} w_{k j, l} μ_{j, l - 1} + b_{k, l},$

(73)

and:

$\begin{matrix} E (y_{k, l}) E (y_{k_{1}, l}) & = \sum_{j = 1}^{N_{l - 1}} \sum_{j_{1} = 1}^{N_{l - 1}} w_{k j, l} w_{k_{1} j_{1}, l} μ_{j, l - 1} μ_{j_{1}, l - 1} \\ + b_{k, l} μ_{k_{1}, l} + b_{k_{1}, l} μ_{k, l} - b_{k, l} b_{k_{1}, l}, \end{matrix}$

(74)

yielding:

$\begin{matrix} E (y_{k, l} y_{k_{1}, l}) - E (y_{k, l}) E (y_{k_{1}, l}) \\ = \sum_{j = 1}^{N_{l - 1}} \sum_{j_{1} = 1}^{N_{l - 1}} w_{k j, l} w_{k_{1} j_{1}, l} [E (y_{j, l - 1} y_{j_{1}, l - 1}) - E (y_{j, l - 1}) E (y_{j_{1}, l - 1})] . \end{matrix}$

(75)

Assuming that the $y_{k, l}$ are independent random variables, in which case, the covariance matrix corresponding to the column vector ${[y_{1 . l}, \dots, y_{N_{l} l}]}^{T}$ is diagonal, meaning that:

$E (y_{k, l} y_{k_{1}, l}) - E (y_{k, l}) E (y_{k_{1}, l}) = δ_{k k_{1}} [E (y_{k, l}^{2}) - E^{2} (y_{k, l})] = δ_{k k_{1}} v_{k, l},$

(76)

we obtain $v_{k, l} = \sum_{j = 1}^{N_{l - 1}} w_{k j, l}^{2} v_{j, l - 1}$ . In summary, the iterative process for a linear layer is:

$\begin{matrix} μ_{k, 0} & = x_{k}, v_{k, 0} = σ_{x k}^{2} . \end{matrix}$

(77)

$\begin{matrix} μ_{k, l} & = \sum_{j = 1}^{N_{l - 1}} w_{k j, l} μ_{j, l - 1} + b_{k, l}, \end{matrix}$

(78)

$\begin{matrix} v_{k, l} & = \sum_{j = 1}^{N_{l - 1}} w_{k j, l}^{2} v_{j, l - 1}, l = 1, \dots, L . \end{matrix}$

(79)

For a ReLU activation function $ReLU (x) = max (0, x)$ , it was shown that:

$μ_{ReLU k, l} = \sqrt{v_{k, l}} ϕ (α) + μ_{k, l} Φ (α),$

(80)

$v_{ReLU k, l} = μ_{k, l} \sqrt{v_{k, l}} ϕ (α) + (μ_{k, l}^{2} + v_{k, l}) Φ (α) - μ_{ReLU k, l}^{2},$

(81)

where $μ_{k, l}$ and $v_{k, l}$ are given by Equations (78) and (79), respectively, and:

$\begin{matrix} α & = \frac{μ_{k, l}}{\sqrt{v_{k, l}}}, \\ ϕ (x) & = \frac{1}{\sqrt{2 π}} exp (- \frac{1}{2} x^{2}), \\ Φ (x) & = \int_{- \infty}^{x} ϕ (y) d y . \end{matrix}$
Interval Arithmetic (IA). Interval arithmetic is based on an extension of the real number system to a system of closed intervals on the real axis [59]. For the intervals X and Y, the elementary arithmetic operations are defined by the rule $X \oplus Y = {x \oplus y | x \in X, y \in Y}$ , where the binary operation ⊕ can stand for addition, subtraction, multiplication, or division. This definition guarantees that $x \oplus y \in X \oplus Y$ . Functions of interval arguments are defined in terms of standard set mapping, that is, the image of an interval X under a function f is the set $f (X) = {f (x) | x \in X}$ . This is not the same as an interval function obtained from a real function f by replacing the real argument by an interval argument and the real arithmetic operations by the corresponding interval operations. The latter is called an interval extension of the real function f and is denoted by $F (X)$ . As a corollary of the fundamental theorem of interval analysis, it can be shown that $f (X) \subseteq F (X)$ . Interval analysis provides a simple and accessible way to assess error propagation. The iterative process for error propagation is (compared with Equations (77)–(79)):

$\begin{matrix} Y_{k, 0} & = [x_{k} - σ_{x k}, x_{k} + σ_{x k}], \end{matrix}$

(82)

$\begin{matrix} U_{k, l} & = \sum_{i = 1}^{N_{l - 1}} w_{k j, l} Y_{j, l - 1} + b_{k, l}, \end{matrix}$

(83)

$\begin{matrix} Y_{k, l} & = G_{l} (U_{k, l}), l = 1, \dots, L, \end{matrix}$

(84)

while the output predictions $μ_{k, L}$ and their standard deviations $σ_{k, L} = \sqrt{v_{k, L}}$ are computed as

$\begin{matrix} μ_{k, L} & = \frac{1}{2} [{\underset{̲}{Y}}_{k, L} + {\bar{Y}}_{k, L}], \end{matrix}$

(85)

$\begin{matrix} σ_{k, L} & = \frac{1}{2} [{\bar{Y}}_{k, L} - {\underset{̲}{Y}}_{k, L}], k = 1, \dots, N_{L}, \end{matrix}$

(86)

where $G_{l} (U)$ is the interval extension of the activation function $g_{l}$ , and $\underset{̲}{X}$ and $\bar{X}$ are the left and right endpoints of the interval X, respectively, that is, $X = [\underset{̲}{X}, \bar{X}]$ .

By assumed density filtering and interval arithmetic, the forward pass of a neural network generates not only the output predictions

μ_{L} = {[μ_{1, L}, \dots, μ_{N_{L}, L}]}^{T}

but also their variances

v_{L} = {[v_{1, L}, \dots, v_{N_{L}, L}]}^{T}

. Following Ref. [2], we consider now a network with dropout, that is, in Equations (78), (79) and (83), we replace

w_{k j, l}

by

w_{k j, l} z_{j, l - 1}

, where

z_{j, l - 1} \sim Bernoulli (p)

and the dropout probability p is the same as that used for training. For the set of samples

{ω_{t}}_{t = 1}^{T}

, where

ω_{t}

corresponds to a realization from the Bernoulli distribution

Z_{l - 1}^{(t)}

and

W_{l}^{(t)} = W_{l} Z_{l - 1}^{(t)}

for all

l = 1, \dots, L

, we denote by

μ_{L} (x, ω_{t})

and

v_{L} (x, ω_{t})

the outputs of the network for an input

x

corrupted by the noise

δ_{x} \sim N (0, C_{x}^{δ} = diag {[σ_{x k}^{2}]}_{k = 1}^{N_{x}})

, and compute the predictive mean and covariance matrix as (Appendix B)

E (y | x) \approx \frac{1}{T} \sum_{t = 1}^{T} μ_{L} (x, ω_{t}),

(87)

\begin{matrix} Cov (y | x) & \approx \frac{1}{T} \sum_{t = 1}^{T} diag {[v_{k, L} (x, ω_{t})]}_{k = 1}^{N_{y}} + \frac{1}{T} \sum_{t = 1}^{T} μ_{L} (x, ω_{t}) μ_{L} {(x, ω_{t})}^{T} \\ - E (y | x) E {(y | x)}^{T} . \end{matrix}

(88)

From Equations (87) and (88), it is obvious that the prediction ensemble is not generated by the output of the network

f (x, ω_{t})

, but by the prediction

μ_{L} (x, ω_{t})

. In summary, the algorithm involves the following steps:

Transform a dropout network into its assumed density filtering or interval arithmetic versions (which does not require retraining);
Propagate $(x, v_{0})$ through the dropout network and collect T output predictions and variances;
Compute the predictive mean and covariance matrix by means of Equations (87) and (88).

The results for Method 3 using assumed density filtering and interval arithmetic are illustrated in Figure 8 and Figure 9, respectively. The main drawback of this method is that it requires the knowledge of the exact input data

x

. Because in atmospheric remote sensing that is not the case, Method 3 will be used for a comparison with the other two methods.

3.3. Summary of Numerical Analysis

In Table 1, we summarize the results of our numerical simulations by illustrating the average relative error and the standard deviation over the prediction set,

E (ε_{x}) \pm \sqrt{E ({[ε_{x} - E (ε_{x})]}^{2})}

and

E (σ_{x})

, respectively, where x stands for the cloud optical thickness

τ

and the cloud top height H. The accuracy of a method is reflected by the bias of the error

E (ε_{x})

and the interval about the mean with length

\sqrt{E ({[ε_{x} - E (ε_{x})]}^{2})}

, while the precision is reflected by the standard deviation

E (σ_{x})

(which determines the length of the uncertainty interval). Note that (i)

\sqrt{E ({[ε_{x} - E (ε_{x})]}^{2})}

reproduces the square root of the diagonal elements of the conditional average covariance matrix

E (C_{y} | D_{test})

of all network errors over the test set

D_{test}

; and (ii) roughly speaking, the epistemic (model) uncertainties are large if there are large variations around the mean.

The results in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 and Table 1 can be summarized as follows:

For the direct problem, the neural network method used in conjunction with a Bayesian inversion method provides satisfactory accuracy, but does not correctly predict the uncertainty. The reason for this is that the forward model is not nearly linear, which is the main assumption for computing the uncertainty in the retrieval.
For the inverse problem, the following features are apparent.
(a)
Method 1 using a deterministic and Bayes-by-backprop network yields a low accuracy, while the method using dropout and batch-normalized networks provides high accuracy;
(b)
Method 2 using a dropout network has an acceptable accuracy. For cloud top height retrieval, the method using a Bayes-by-backprop network has a similar accuracy, but for cloud optical thickness retrieval, the accuracy is low. Possible reasons for the loss of accuracy of a Bayes-by-backprop network can be (i) a non-optimal training and/or the use of the prior $p (ω) = N (ω; 0, I)$ instead of that given by Equation (45) (recall that for $p (ω) = N (ω; 0, I)$ , the KL divergence $KL (q_{θ} (ω) | p (ω))$ can be computed analytically).
(c)
Method 3 with assumed density filtering and interval arithmetic is comparable with Method 2 using a dropout network, that is, the method has a high accuracy.
(d)
From the methods with reasonable accuracy, by using a dropout network, Method 2 predicts similar uncertainties to Method 3 with assumed density filtering and interval arithmetic. In contrast, using dropout and batch normalized networks, Method 1 provides lower uncertainties, dominated by epistemic uncertainties. In general, it seems that Method 1 predicts a too small heteroscedastic uncertainty. Possibly, this is due to the fact that we trained a neural network with exact data, and for this retrieval example, the predictive distribution, represented as a convolution integral over the input noise distribution, does not correctly reproduce the aleatoric uncertainty.
(e)
Because $\sqrt{E ({[ε_{x} - E (ε_{x})]}^{2})} < E (σ_{x})$ , we deduce that the conditional average covariance matrix $E (C_{y} | D_{test})$ does not coincide with the predictive covariance matrix $Cov (y)$ , which reflects the uncertainty.

4. Conclusions

We presented several neural networks for predicting uncertainty in an atmospheric remote sensing. The neural networks are designed in a Bayesian framework and are devoted to the solution of direct and inverse problems.

For solving the direct problem, we considered a neural network for simulating the radiative transfer model, computed of the epistemic covariance matrix from the statistics of all network errors over the data set, solving the inverse problem by a Bayesian approach, and determined the uncertainty in the retrieval by assuming that the forward model is nearly linear.
For solving the inverse problem, two neural network methods, relying on different assumptions, were implemented:
(a)
The first method uses deterministic and stochastic (Bayes-by-backprop, dropout, and batch normalization) networks to compute the epistemic covariance matrix and under the assumption that the predictive distribution of the network output is the convolution of the predictive distribution for a noise-free input with the input noise distribution, estimates the covariance matrix;
(b)
The second method uses dropout and Bayes-by-backprop to learn the heteroscedastic covariance matrix from the data.

In addition, for solving the inverse problem, a third method that uses a dropout network and forward propagates the input noise through the network by using assumed density filtering and interval arithmetic was designed. Because this method requires the knowledge of the exact input data, it was used only for testing purposes.

Our numerical analysis has shown that a dropout network that is used to learn the heteroscedastic covariance matrix from the data is appropriate for predicting the uncertainty associated with the retrieval of cloud parameters from EPIC measurements. In fact, the strengths of a dropout network are (i) its capability to avoid overfitting and (ii) its stochastic character (the method is equivalent to a Bayesian approximation).

All neural network algorithms are implemented in FORTRAN and incorporated in a common tool. In the future, we intend to implement the algorithms in the high-level programming language Python and use the deep learning library PyTorch. This software library has a variety of network architectures that provide auto-differentiation and support GPUs to enable fast and efficient computation. The Python tool will be released through a public repository to make the methods available to the scientific community.

Author Contributions

Conceptualization, A.D. (Adrian Doicu) and A.D. (Alexandru Doicu); software, A.D. (Adrian Doicu), A.D. (Alexandru Doicu) and D.S.E.; formal analysis, D.L. and T.T.; writing—original draft preparation, A.D. (Adrian Doicu) and A.D. (Alexandru Doicu); writing—review and editing, D.S.E., D.L. and T.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

According to the Laplace approximation, we expand the loss function:

\begin{matrix} E (ω) & = \frac{1}{2} \sum_{n = 1}^{N} {[y^{(n)} - f (z^{(n)}, ω)]}^{T} {[C_{y}^{δ} (z^{(n)}, ω)]}^{- 1} [y^{(n)} - f (z^{(n)}, ω)] \\ + \frac{1}{2} ω^{T} C_{ω}^{- 1} ω, \end{matrix}

(A1)

around the point estimate

\hat{ω} = ω_{MAP}

and use the optimality condition

\nabla E (\hat{ω}) = 0

, to obtain:

\begin{matrix} E (ω) & = E (\hat{ω}) + \frac{1}{2} Δ ω^{T} H (\hat{ω}) Δ ω, \end{matrix}

(A2)

\begin{matrix} Δ ω & = ω - \hat{ω}, \end{matrix}

(A3)

\begin{matrix} {[H (\hat{ω})]}_{i j} & = \frac{\partial E}{\partial ω_{i} \partial ω_{j}} (\hat{ω}), \end{matrix}

(A4)

and further (cf. Equation (13)):

\begin{matrix} p (ω | D) & \propto exp [- E (ω)] \\ \propto exp [- \frac{1}{2} {(ω - \hat{ω})}^{T} H (\hat{ω}) (ω - \hat{ω})] . \end{matrix}

(A5)

Inserting Equations (12) and (A5) in the expression of the predictive distribution as given by Equation (28) we find:

\begin{matrix} p (y | z, D) & = \int p (y | z, ω) p (ω | D) d ω \\ \propto \int exp \{- \frac{1}{2} {[y - f (z, ω)]}^{T} {[C_{y}^{δ} (z, ω)]}^{- 1} [y - f (z, ω)]\} \\ \times exp [- \frac{1}{2} {(ω - \hat{ω})}^{T} H (\hat{ω}) (ω - \hat{ω})] d ω . \end{matrix}

(A6)

In Equation (A6), the model function

f (x, ω)

can be approximated by a linear Taylor expansion around

\hat{ω}

, that is:

f (x, ω) \approx f (x, \hat{ω}) + K_{ω} (x, \hat{ω}) (ω - \hat{ω}),

(A7)

where:

K_{ω} (x, ω) = \frac{\partial f}{\partial ω} (x, ω)

(A8)

is the Jacobian of

f

with respect to

ω

. Substituting Equation (A7) into Equation (A6), approximating

C_{y}^{δ} (z, ω) \approx C_{y}^{δ} (z, \hat{ω})

, and computing the integral over

ω

gives the representation (32) for the predictive distribution

p (y | z, D)

.

Appendix B

For

p (y | x, ω) = N (y, f (x, ω), C_{y}^{δ} = σ_{y}^{2} I)

, we compute the predictive mean of the output

y

given the input

x

, as

\begin{matrix} E (y) & = \int y p (y | x, D) d y \\ \approx \int y q_{θ} (y | x) d y \\ = \int y (\int p (y | x, ω) q_{θ} (ω) d ω) d y \\ = \int (\int y N (y; f (x, ω), σ_{y}^{2} I) d y) q_{θ} (ω) d ω \\ = \int f (x, ω) q_{θ} (ω) d ω \\ \approx \frac{1}{T} \sum_{t = 1}^{T} f (x, ω_{t}), \end{matrix}

and by using the result:

\begin{matrix} E (y y^{T}) & = \int y y^{T} p (y | x, D) d y \\ \approx \int y y^{T} q_{θ} (y | x) d y \\ = \int y y^{T} (\int p (y | x, ω) q_{θ} (ω) d ω) d y \\ = \int (\int y y^{T} N (y; f (x, ω), σ_{y}^{2} I) d y) q_{θ} (ω) d ω \\ = \int [σ_{y}^{2} I + f (x, ω) f {(x, ω)}^{T}] q_{θ} (ω) d ω \\ \approx σ_{y}^{2} I + \frac{1}{T} \sum_{t = 1}^{T} f (x, ω_{t}) f {(x, ω_{t})}^{T}, \end{matrix}

the covariance matrix as

Cov (y) \approx σ_{y}^{2} I + \frac{1}{T} \sum_{t = 1}^{T} f (x, ω_{t}) f {(x, ω_{t})}^{T} - E (y) E {(y)}^{T} .

The predictive mean and covariance matrix of a dropout network with assumed density filtering given by Equations (87) and (88), respectively, can be computed in the same manner by taking into account that in this case, the predictive power of the network is given by

p (y | x, ω) = N (y; μ_{L} (x, ω), diag {[v_{j, L} (x, ω)]}_{j = 1}^{N_{y}}),

where

μ_{L} = {[μ_{1 . L}, \dots, μ_{N_{L} L}]}^{T}

and

v_{L} = {[v_{1 . L}, \dots, v_{N_{L} L}]}^{T}

are the output predictions and their variances.

Appendix C

In this appendix, which is borrowed from [37], we show that the variational free energy has the standard form representation of the dropout loss function (as the sum of a square loss function and an

L_{2}

regularization term).

For

ω = {W_{l}, b_{l}}_{l = 1}^{L}

and

W_{l} = {[w_{k, l}]}_{k = 1}^{N_{l - 1}} \in R^{N_{l} \times N_{l - 1}}

, we construct the variational distribution

q_{θ} (ω)

as

q_{θ} (ω) = \prod_{l = 1}^{L} q_{θ} (W_{l}, b_{l}) = (\prod_{l = 1}^{L} q (W_{l})) (\prod_{l = 1}^{L} q (b_{l})),

(A9)

with:

\begin{matrix} q (W_{l}) & = \prod_{k = 1}^{N_{l - 1}} q (w_{k, l}), \end{matrix}

(A10)

\begin{matrix} q (w_{k, l}) & = p N (m_{k, l}, σ^{2} I_{N_{l}}) + (1 - p) N (0, σ^{2} I_{N_{l}}), \end{matrix}

(A11)

and:

q (b_{l}) = N (n_{l}, σ^{2} I_{N_{l}}) .

(A12)

Here,

p \in [0, 1]

is an activation probability,

σ > 0

a scalar, and

M_{l} = {[m_{k, l}]}_{k = 1}^{N_{l - 1}}

and

n_{l}

are a variational parameter to be determined; thus,

θ = {M_{l}, n_{l}}_{l = 1}^{L}

. The key point of the derivation is the representation of

q (w_{k, l})

as a mixture of two Gaussians with the same variance (cf. Equation (A11)). When the standard deviation

σ

tends towards 0, the Gaussians tend to Dirac delta distributions showing that when sampling from the mixture of the two Gaussians is equivalent to sampling from a Bernoulli distribution that returns either the value

0

with probability

1 - p

or

m_{k, l}

with probability

p,

that is:

q (w_{k, l}) = \{\begin{matrix} m_{k, l} \\ 0 \end{matrix} \begin{matrix} with probability p \\ with probability 1 - p \end{matrix}| .

As a result, we obtain:

\begin{matrix} W_{l} & = [w_{1, l}, w_{2, l}, \dots, w_{N_{l - 1}, l}] \\ = [m_{1, l}, m_{2, l}, \dots, m_{N_{l - 1}, l}] [\begin{matrix} z_{1, l - 1} & 0 & \dots & 0 \\ 0 & z_{2, l - 1} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & z_{N_{l - 1}, l - 1} \end{matrix}] \\ = M_{l} Z_{l - 1}, \end{matrix}

where

Z_{l - 1} = diag {[z_{k, l - 1}]}_{k = 1}^{N_{l - 1}}

with

z_{k, l - 1} \sim Bernoulli (p)

. Note that the binary variable

z_{k, l - 1}

corresponds to the unit k in layer

l - 1

being dropped out as an input to layer l. For

b_{l}

, we take into account that

q (b_{l}) = {lim}_{σ \to 0} N (n_{l}, σ^{2} I_{N_{l}}) = δ (b_{l} - n_{l})

; hence, in the limit

σ \to 0

,

b_{l}

is approximately deterministic, and we have

b_{l} \approx n_{l}

.

The variational parameters

M_{l}

and

n_{l}

are computed by minimizing the variational free energy (39), that is:

\begin{matrix} F (θ, D) & = - \int q_{θ} (ω) log p (D | ω) d ω + KL (q_{θ} (ω) | p (ω)) . \end{matrix}

(A13)

The two terms in the above equation are computed as follows.

For the first term, written as

$\int q_{θ} (ω) log p (D | ω) d ω = \sum_{n = 1}^{N} \int q_{θ} (ω) log p (y^{(n)} | x^{(n)}, ω) d ω,$

(A14)

we use the following reparameterization trick. Let $ϵ \sim q (ϵ)$ be an auxiliary variable representing the stochasticity during the training, such that $ω = t (ϵ, θ)$ for some function t. Assuming that $q_{θ} (ω | ϵ) = δ (ω - t (ϵ, θ))$ , we find:

$q_{θ} (ω) = \int q_{θ} (ω | ϵ) q (ϵ) d ϵ = \int δ (ω - t (ϵ, θ)) q (ϵ) d ϵ,$

(A15)

implying:

$\begin{matrix} \int q_{θ} (ω) log p (y^{(n)} | x^{(n)}, ω) d ω \\ = \int \int δ (ω - t (ϵ, θ)) q (ϵ) log p (y^{(n)} | x^{(n)}, ω) d ω d ϵ \\ = \int q (ϵ) log p (y^{(n)} | x^{(n)}, t (ϵ, θ)) d ϵ . \end{matrix}$

(A16)

Computing the above integral by a Monte Carlo approach with a single sample $\hat{ϵ} \sim q (ϵ)$ , yields:

$\int q_{θ} (ω) log p (y^{(n)} | x^{(n)}, ω) d ω = log p (y^{(n)} | x^{(n)}, \hat{ω}),$

(A17)

where $\hat{ω} = t (\hat{ϵ}, θ)$ . In our case and in view of Equations (A10)–(A12), we reparametrize the integrands by setting:

$\begin{matrix} W_{l} & = (M_{l} + σ S_{l}) Z_{l - 1} + σ S_{l} (I_{N_{l - 1}} - Z_{l - 1}), \end{matrix}$

(A18)

$\begin{matrix} b_{l} & = n_{l} + σ ϵ_{l}, \end{matrix}$

(A19)

where:

$\begin{matrix} S_{l} & = {[s_{k, l}]}_{k = 1}^{N_{l - 1}}, s_{k, l} \sim N (0, I_{N_{l}}), \end{matrix}$

(A20)

$\begin{matrix} Z_{l - 1} & = diag {[z_{k, l - 1}]}_{k = 1}^{N_{l - 1}}, z_{k, l - 1} \sim Bernoulli (p), \end{matrix}$

(A21)

$\begin{matrix} ϵ_{l} & \sim N (0, I_{N_{l}}), \end{matrix}$

(A22)

to obtain:

$\begin{matrix} \int q_{θ} (ω) log p (D | ω) d ω & = \sum_{n = 1}^{N} log p (y^{(n)} | x^{(n)}, {\hat{ω}}_{n}), \end{matrix}$

(A23)

where:

$\begin{matrix} {\hat{ω}}_{n} & = {{\hat{W}}_{l}^{(n)}, {\hat{b}}_{l}^{(n)}}_{l = 1}^{L}, \end{matrix}$

(A24)

$\begin{matrix} {\hat{W}}_{l}^{(n)} & = (M_{l} + σ {\hat{S}}_{l}^{(n)}) {\hat{Z}}_{l - 1}^{(n)} + σ {\hat{S}}_{l}^{(n)} (I_{N_{l - 1}} - {\hat{Z}}_{l - 1}^{(n)}), \end{matrix}$

((A25)

$\begin{matrix} {\hat{b}}_{l}^{(n)} & = n_{l} + σ {\hat{ϵ}}_{l}^{(n)}, \end{matrix}$

(A26)

for the realizations ${\hat{S}}_{l}^{(n)}$ , ${\hat{Z}}_{l - 1}^{(n)}$ , and ${\hat{ϵ}}_{l}^{(n)}$ given by Equations (A20)–(A22). Taking the limit $σ \to 0$ , we find that the realizations ${\hat{W}}_{l}^{(n)}$ and ${\hat{b}}_{l}^{(n)}$ can be approximated as

${\hat{W}}_{l}^{(n)} \approx M_{l} {\hat{Z}}_{l - 1}^{(n)}, {\hat{b}}_{l}^{(n)} \approx n_{l} .$

(A27)
In the case of $W_{l}$ , the second term in Equation (A13) is the KL divergence between a mixture of Gaussians and a single Gaussian, that is:

$KL (q (W_{l}) | p (W_{l})) = \int q (W_{l}) log [\frac{q (W_{l})}{p (W_{l})}] d W_{l},$

where $q (W_{l})$ is as in Equations (A10) and (A11) and $p (W_{l}) = \prod_{k = 1}^{N_{l - 1}} p (w_{k, l})$ with $p (w_{k, l}) = N (0, I_{N_{l}})$ . This term can be evaluated by using the following result: For $K, N \in N$ , let $p = (p_{1}, \dots, p_{K})$ be a probability vector, $q (x) = \sum_{k = 1}^{K} p_{k} N (x; μ_{k}, Σ_{k})$ with $x \in R^{N}$ a mixture of Gaussians with K components, and $p (x) = N (x; 0, I_{N})$ . Then, for sufficiently large N, we have the approximation:

$KL (q (x) | p (x)) \approx \sum_{k = 1}^{K} [μ_{k}^{T} μ_{k} + tr (Σ_{k}) - N (1 + log 2 π) - log | Σ_{k} |] .$

(A28)

Consequently, for large numbers of hidden units $N_{l}$ , $l = 1, \dots, L$ , we find:

$KL (q (W_{l}) | p (W_{l})) \approx N_{l} N_{l - 1} (σ^{2} - log (σ^{2}) - 1) + \frac{p}{2} \sum_{k = 1}^{N_{l - 1}} m_{k, l}^{T} m_{k, l} + C,$

(A29)

where C is a constant. In the case of $b_{l}$ , the KL divergence $KL (q (b_{l}) | p (b_{l}))$ , where (cf. Equation (A12)) $q (b_{l}) = N (n_{l}, σ^{2} I_{N_{l}})$ and $p (b_{l}) = N (0, I_{N_{l}})$ , is a mixture of two single Gaussian and can be analytically computed as

$KL (q (b_{l}) | p (b_{l})) = \frac{1}{2} [n_{l}^{T} n_{l} + N_{l} (σ^{2} - log (σ^{2}) - 1)] + C .$

(A30)

Collecting all the results, we obtain:

\begin{matrix} F (θ, D) & = - \sum_{n = 1}^{N} log p (y^{(n)} | x^{(n)}, {\hat{ω}}_{n}) + \frac{1}{2} \sum_{l = 1}^{L} (p {∥M_{l}∥}_{2}^{2} + {∥n_{l}∥}_{2}^{2}) . \end{matrix}

Thus, the variational free energy

F (θ, D)

has the standard form representation as the sum of a square loss function and an

L_{2}

regularization term.

Appendix D

In this appendix, we describe the uncertainty propagation based on assumed density filtering by following the analysis given in [58].

The feed-forward operation of a neural network can be written as (cf. Equations (2) and (3)):

y_{l} = f_{l} (y_{l - 1}; ω_{l}) = g_{l} (W_{l} y_{l - 1} + b_{l}),

where

ω_{l} = {W_{l}, b_{l}}

. Thus, each layer function

f_{l} (y_{l - 1}; ω_{l})

is a nonlinear transformation of the previous activation

y_{l - 1}

parametrized by

ω_{l}

. The deep neural network can then be expressed as a succession of nonlinear layers:

\begin{matrix} f (x, ω) & = f_{L} (f_{L - 1} (\dots f_{1} (y_{0}; ω_{1})) . \end{matrix}

To formalize the deep probabilistic model, we replace each activation, including input and output, by probability distributions. In particular, we assume that the joint density of all activations is given by

\begin{matrix} p (y_{0 : L}) & = p (y_{0}) \prod_{l = 1}^{L} p (y_{l} | y_{l - 1}), \\ p (y_{l} | y_{l - 1}) & = δ (y_{l} - f_{l} (y_{l - 1}; ω_{l})), \\ p (y_{0}) & = \prod_{k = 1}^{N_{l}} N (y_{k, 0}; x_{k}, σ_{x k}^{2}), \end{matrix}

where

p (y_{0 : L}) = p (y_{0}, \dots, y_{L})

and

C_{x}^{δ} = diag {[σ_{x k}^{2}]}_{k = 1}^{N_{x}}

. Because this distribution is intractable, we apply the assumed density filtering approach to the network activations. The goal of this approach is to find a tractable approximation

q (y_{0 : L})

of

p (y_{0 : L})

, that is:

p (y_{0 : L}) \approx q (y_{0 : L}),

where:

\begin{matrix} q (y_{0 : L}) & = q (y_{0}) \prod_{l = 1}^{L} q (y_{l}), \\ q (y_{l}) & = \prod_{k = 1}^{N_{l}} N (y_{k, l}; μ_{k, l}, v_{k, l}), \\ q (y_{0}) & = p (y_{0}), \end{matrix}

and

μ_{k, l}

and

v_{k, l}

are the mean and variance of the activation of unit k in layer l, respectively. Thus, starting from input activation

q (y_{0}) = p (y_{0})

, we approximate subsequent layer activations by independent Gaussian distributions

q (y_{l})

. To compute the approximant

q (y_{0 : L})

, we use an iterative process (layer by layer) initialized by

q (y_{0}) = p (y_{0})

. In particular, for a layer

l \geq 1

, we assume that

q (y_{0}), \dots, q (y_{l - 1})

are known, or equivalently, that

{(μ_{l_{1}}, v_{l_{1}})}_{l_{1} = 0}^{l - 1}

are known, and aim to compute

(μ_{l}, v_{l})

, where

μ_{k, l} = {[μ_{l}]}_{k}

and

v_{k, l} = {[v_{l}]}_{k}

. For this purpose, we take into account that the layer function

f_{l}

transforms the activation

y_{l - 1}

into the distribution:

\hat{p} (y_{0 : l}) = p (y_{l} | y_{l - 1}) q (y_{0 : l - 1}) = p (y_{l} | y_{l - 1}) \prod_{l_{1} = 0}^{l - 1} q (y_{l_{1}}),

where

p (y_{l} | y_{l - 1}) = δ (y_{l} - f_{l} (y_{l - 1}; ω_{l}))

is the true posterior at layer l and

q (y_{0 : l - 1}) = \prod_{l_{1} = 0}^{l - 1} q (y_{l_{1}})

the previous approximating factor. Furthermore, we compute the first and second-order moments of

\hat{p} (y_{0 : l})

. This will be done in two steps. In the first step, we derive the moments of an activation variable

y_{k}

that belongs to all layers excluding the last layer, i.e.,

y_{k}

is an element of

y_{0 : l - 1} = {y_{0}, \dots, y_{l - 1}}

, while in the second step, we assume that

y_{k}

is an activation variable contained in the last layer

y_{l} = {y_{1, l}, \dots, y_{N_{l}, l}}

. Thus,

For $y_{k} \in y_{0 : l - 1}$ , we use the relations:

$\begin{matrix} \hat{p} (y_{0 : l}) & = δ (y_{l} - f_{l} (y_{l - 1}; ω_{l})) q (y_{k}) q (y_{\bar{k}, 0 : l - 1}), \end{matrix}$

and $\int δ (x - x_{0}) d x = 1$ (yielding $\int (y_{l} - f_{l} (y_{l - 1}; ω_{l})) d y_{l} = 1$ ) to obtain:

$\begin{matrix} E_{\hat{p}} (y_{k}) & = \int \hat{p} (y_{0 : l}) y_{k} d y = \int q (y_{k}) y_{k} d y_{k} = E_{q (y_{k})} (y_{k}), \end{matrix}$

(A31)

where $q (y_{\bar{k}, 0 : l - 1}^{(0 : l - 1)})$ corresponds to the density of all variables excluding $y_{k}$ , and $d y = \prod_{l_{1} = 0}^{l} d y_{l_{1}}$ , while:
For $y_{k} \in y_{l}$ , we use the relations:

$\begin{matrix} \hat{p} (y_{0 : l}) & = δ (y_{\bar{k}, l} - f_{\bar{k}, l} (y_{l - 1}; ω_{l})) δ (y_{k} - f_{k, l} (y_{l - 1}; ω_{l})) q (y_{0 : l - 1}), \end{matrix}$

and $\int x δ (x - x_{0}) d x = x_{0}$ (yielding $\int δ (y_{k} - f_{k, l} (y_{l - 1}; ω_{l})) d y_{k} = f_{k, l} (y_{l - 1}; ω_{l})$ ), to obtain:

$\begin{matrix} E_{\hat{p}} (y_{k}) & = \int \hat{p} (y_{0 : l}) y_{k} d y = \int q (y_{l - 1}) f_{k, l} (y_{l - 1}; ω_{l}) d y_{l - 1} \\ = E_{q (y_{l - 1})} (f_{k, l} (y_{l - 1}; ω_{l})) . \end{matrix}$

(A32)

Replacing

y_{k}

with

y_{k}^{2}

and repeating the above arguments, we find

\begin{matrix} E_{\hat{p}} (y_{k}^{2}) & = E_{q (y_{k})} (y_{k}^{2}), y_{k} \in y_{0 : l - 1}, \end{matrix}

(A33)

\begin{matrix} E_{\hat{p}} (y_{k}^{2}) & = E_{q (y_{l - 1})} (f_{k, l}^{2} (y_{l - 1}; ω_{l})), y_{k} \in y_{l} . \end{matrix}

(A34)

From Equations (A31) and (A33), we see that for all layers except for the lth layer, the moments remain unchanged after the update. The moments for the lth layer will be computed by means of Equations (A32) and (A34). For a linear activation function, i.e.,

f_{k, l} (y_{l - 1}; ω_{l}) = \sum_{i = 1}^{N_{l - 1}} w_{k i, l - 1} y_{i, l - 1} + b_{k, l}

, we find that the expressions of

μ_{k, l}

and

v_{k, l}

are given by Equations (80) and (81), respectively, while for a ReLU activation function

ReLU (x) = max (0, x)

, these are given by Equations (80) and (81), respectively.

References

Tibshirani, R. A Comparison of Some Error Estimates for Neural Network Models. Neural Comput. 1996, 8, 152–163. [Google Scholar] [CrossRef]
Loquercio, A.; Segu, M.; Scaramuzza, D. A General Framework for Uncertainty Estimation in Deep Learning. IEEE Robot. Autom. Lett. 2020, 5, 3153–3160. [Google Scholar] [CrossRef] [Green Version]
Oh, S.; Byun, J. Bayesian Uncertainty Estimation for Deep Learning Inversion of Electromagnetic Data. IEEE Geosci. Remote Sens. Lett. 2021, 1–5. [Google Scholar] [CrossRef]
Chevallier, F.; Chéruy, F.; Scott, N.A.; Chédin, A. A Neural Network Approach for a Fast and Accurate Computation of a Longwave Radiative Budget. J. Appl. Meteorol. 1998, 37, 1385–1397. [Google Scholar] [CrossRef]
Chevallier, F.; Morcrette, J.J.; Chéruy, F.; Scott, N.A. Use of a neural-network-based long-wave radiative-transfer scheme in the ECMWF atmospheric model. Q. J. R. Meteorol. Soc. 2000, 126, 761–776. [Google Scholar] [CrossRef]
Cornford, D.; Nabney, I.T.; Ramage, G. Improved neural network scatterometer forward models. J. Geophys. Res. Ocean. 2001, 106, 22331–22338. [Google Scholar] [CrossRef] [Green Version]
Krasnopolsky, V.M. The Application of Neural Networks in the Earth System Sciences; Springer: Dordrecht, The Netherlands, 2013. [Google Scholar] [CrossRef]
Efremenko, D.S. Discrete Ordinate Radiative Transfer Model With the Neural Network Based Eigenvalue Solver: Proof of Concept. Light Eng. 2021, 1, 56–62. [Google Scholar] [CrossRef]
Fan, Y.; Li, W.; Gatebe, C.K.; Jamet, C.; Zibordi, G.; Schroeder, T.; Stamnes, K. Atmospheric correction over coastal waters using multilayer neural networks. Remote Sens. Environ. 2017, 199, 218–240. [Google Scholar] [CrossRef]
Fan, C.; Fu, G.; Noia, A.D.; Smit, M.; Rietjens, J.H.; Ferrare, R.A.; Burton, S.; Li, Z.; Hasekamp, O.P. Use of A Neural Network-Based Ocean Body Radiative Transfer Model for Aerosol Retrievals from Multi-Angle Polarimetric Measurements. Remote Sens. 2019, 11, 2877. [Google Scholar] [CrossRef] [Green Version]
Gao, M.; Franz, B.A.; Knobelspiesse, K.; Zhai, P.W.; Martins, V.; Burton, S.; Cairns, B.; Ferrare, R.; Gales, J.; Hasekamp, O.; et al. Efficient multi-angle polarimetric inversion of aerosols and ocean color powered by a deep neural network forward model. Atmos. Meas. Tech. 2021, 14, 4083–4110. [Google Scholar] [CrossRef]
Shi, C.; Hashimoto, M.; Shiomi, K.; Nakajima, T. Development of an Algorithm to Retrieve Aerosol Optical Properties Over Water Using an Artificial Neural Network Radiative Transfer Scheme: First Result From GOSAT-2/CAI-2. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9861–9872. [Google Scholar] [CrossRef]
Jiménez, C.; Eriksson, P.; Murtagh, D. Inversion of Odin limb sounding submillimeter observations by a neural network technique. Radio Sci. 2003, 38, 27-1–27-8. [Google Scholar] [CrossRef]
Holl, G.; Eliasson, S.; Mendrok, J.; Buehler, S.A. SPARE-ICE: Synergistic ice water path from passive operational sensors. J. Geophys. Res. Atmos. 2014, 119, 1504–1523. [Google Scholar] [CrossRef]
Strandgren, J.; Bugliaro, L.; Sehnke, F.; Schröder, L. Cirrus cloud retrieval with MSG/SEVIRI using artificial neural networks. Atmos. Meas. Tech. 2017, 10, 3547–3573. [Google Scholar] [CrossRef] [Green Version]
Efremenko, D.S.; Loyola R, D.G.; Hedelt, P.; Spurr, R.J.D. Volcanic SO₂ plume height retrieval from UV sensors using a full-physics inverse learning machine algorithm. Int. J. Remote Sens. 2017, 38, 1–27. [Google Scholar] [CrossRef] [Green Version]
Wang, D.; Prigent, C.; Aires, F.; Jimenez, C. A Statistical Retrieval of Cloud Parameters for the Millimeter Wave Ice Cloud Imager on Board MetOp-SG. IEEE Access 2017, 5, 4057–4076. [Google Scholar] [CrossRef]
Brath, M.; Fox, S.; Eriksson, P.; Harlow, R.C.; Burgdorf, M.; Buehler, S.A. Retrieval of an ice water path over the ocean from ISMAR and MARSS millimeter and submillimeter brightness temperatures. Atmos. Meas. Tech. 2018, 11, 611–632. [Google Scholar] [CrossRef] [Green Version]
Håkansson, N.; Adok, C.; Thoss, A.; Scheirer, R.; Hörnquist, S. Neural network cloud top pressure and height for MODIS. Atmos. Meas. Tech. 2018, 11, 3177–3196. [Google Scholar] [CrossRef] [Green Version]
Hedelt, P.; Efremenko, D.S.; Loyola, D.G.; Spurr, R.; Clarisse, L. Sulfur dioxide layer height retrieval from Sentinel-5 Precursor/TROPOMI using FP_ILM. Atmos. Meas. Tech. 2019, 12, 5503–5517. [Google Scholar] [CrossRef] [Green Version]
Noia, A.D.; Hasekamp, O.P.; van Harten, G.; Rietjens, J.H.H.; Smit, J.M.; Snik, F.; Henzing, J.S.; de Boer, J.; Keller, C.U.; Volten, H. Use of neural networks in ground-based aerosol retrievals from multi-angle spectropolarimetric observations. Atmos. Meas. Tech. 2015, 8, 281–299. [Google Scholar] [CrossRef] [Green Version]
Noia, A.D.; Hasekamp, O.P.; Wu, L.; van Diedenhoven, B.; Cairns, B.; Yorks, J.E. Combined neural network/Phillips–Tikhonov approach to aerosol retrievals over land from the NASA Research Scanning Polarimeter. Atmos. Meas. Tech. 2017, 10, 4235–4252. [Google Scholar] [CrossRef] [Green Version]
Aires, F. Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 1. Network weights. J. Geophys. Res. 2004, 109. [Google Scholar] [CrossRef] [Green Version]
Aires, F. Neural network uncertainty assessment using Bayesian statistics with application to remote sensing: 2. Output errors. J. Geophys. Res. 2004, 109. [Google Scholar] [CrossRef]
Pfreundschuh, S.; Eriksson, P.; Duncan, D.; Rydberg, B.; Håkansson, N.; Thoss, A. A neural network approach to estimating a posteriori distributions of Bayesian retrieval problems. Atmos. Meas. Tech. 2018, 11, 4627–4643. [Google Scholar] [CrossRef] [Green Version]
Arnold, V. On the representation of functions of several variables as a superposition of functions of a smaller number of variables. Math. Teach. Appl. Hist. Matem. Prosv. Ser. 2 1958, 3, 41–61. [Google Scholar]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Nix, D.; Weigend, A. Estimating the mean and variance of the target probability distribution. In Proceedings of the 1994 IEEE International Conference on Neural Networks (ICNN–94), Orlando, FL, USA, 28 June–2 July 1994. [Google Scholar] [CrossRef]
Wright, W.; Ramage, G.; Cornford, D.; Nabney, I. Neural Network Modelling with Input Uncertainty: Theory and Application. J. VLSI Signal Process. 2000, 26, 169–188. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y.; Denker, J.; Solla, S. Optimal Brain Damage. In Advances in Neural Information Processing Systems; Touretzky, D., Ed.; Morgan-Kaufmann: Burlington, MA, USA, 1990; Volume 2, pp. 598–605. [Google Scholar]
MacKay, D.J.C. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight Uncertainty in Neural Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, Lille, France, 6–11 July 2015; pp. 1613–1622. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2014, arXiv:1312.6114. [Google Scholar]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv 2016, arXiv:1506.02142. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Teye, M.; Azizpour, H.; Smith, K. Bayesian Uncertainty Estimation for Batch Normalized Deep Networks. arXiv 2018, arXiv:1802.06455. [Google Scholar]
Efremenko, D.S.; Molina García, V.; Gimeno García, S.; Doicu, A. A review of the matrix-exponential formalism in radiative transfer. J. Quant. Spectrosc. Radiat. Transf. 2017, 196, 17–45. [Google Scholar] [CrossRef] [Green Version]
Efremenko, D.; Kokhanovsky, A. Foundations of Atmospheric Remote Sensing; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
Efremenko, D.; Doicu, A.; Loyola, D.; Trautmann, T. Acceleration techniques for the discrete ordinate method. J. Quant. Spectrosc. Radiat. Transf. 2013, 114, 73–81. [Google Scholar] [CrossRef]
Goody, R.; West, R.; Chen, L.; Crisp, D. The correlated-k method for radiation calculations in nonhomogeneous atmospheres. J. Quant. Spectrosc. Radiat. Transf. 1989, 42, 539–550. [Google Scholar] [CrossRef]
Efremenko, D.; Doicu, A.; Loyola, D.; Trautmann, T. Optical property dimensionality reduction techniques for accelerated radiative transfer performance: Application to remote sensing total ozone retrievals. J. Quant. Spectrosc. Radiat. Transf. 2014, 133, 128–135. [Google Scholar] [CrossRef]
Molina García, V.; Sasi, S.; Efremenko, D.S.; Doicu, A.; Loyola, D. Radiative transfer models for retrieval of cloud parameters from EPIC/DSCOVR measurements. J. Quant. Spectrosc. Radiat. Transf. 2018, 213, 228–240. [Google Scholar] [CrossRef] [Green Version]
del Águila, A.; Efremenko, D.S.; Trautmann, T. A Review of Dimensionality Reduction Techniques for Processing Hyper-Spectral Optical Signal. Light Eng. 2019, 27, 85–98. [Google Scholar] [CrossRef] [Green Version]
Schreier, F.; Gimeno García, S.; Hedelt, P.; Hess, M.; Mendrok, J.; Vasquez, M.; Xu, J. GARLIC—A general purpose atmospheric radiative transfer line-by-line infrared-microwave code: Implementation and evaluation. J. Quant. Spectrosc. Radiat. Transf. 2014, 137, 29–50. [Google Scholar] [CrossRef] [Green Version]
Schreier, F. Optimized implementations of rational approximations for the Voigt and complex error function. J. Quant. Spectrosc. Radiat. Transf. 2011, 112, 1010–1025. [Google Scholar] [CrossRef]
Gordon, I.; Rothman, L.; Hill, C.; Kochanov, R.; Tan, Y.; Bernath, P.; Birk, M.; Boudon, V.; Campargue, A.; Chance, K.; et al. The HITRAN2016 molecular spectroscopic database. J. Quant. Spectrosc. Radiat. Transf. 2017, 203, 3–69. [Google Scholar] [CrossRef]
Bodhaine, B.A.; Wood, N.B.; Dutton, E.G.; Slusser, J.R. On Rayleigh Optical Depth Calculations. J. Atmos. Ocean. Technol. 1999, 16, 1854–1861. [Google Scholar] [CrossRef]
Anderson, G.; Clough, S.; Kneizys, F.; Chetwynd, J.; Shettle, E. AFGL Atmospheric Constituent Profiles (0-120 km). AFGL-TR-86-0110; Air Force Geophysics Laboratory: Hanscom Air Force Base, MA, USA, 1986. [Google Scholar]
Loyola R, D.G.; Pedergnana, M.; García, S.G. Smart sampling and incremental function learning for very large high dimensional data. Neural Netw. 2016, 78, 75–87. [Google Scholar] [CrossRef] [Green Version]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Rodgers, C. Inverse Methods for Atmospheric Sounding: Theory and Practice; Wolrd Scientific Publishing: Singapore, 2000. [Google Scholar]
Doicu, A.; Trautmann, T.; Schreier, F. Numerical Regularization for Atmospheric Inverse Problems; Springer: Berlin, Germany, 2010. [Google Scholar]
Tresp, V.; Ahmad, S.; Neuneier, R. Training Neural Networks with Deficient Data. In Advances in Neural Information Processing Systems; Cowan, J., Tesauro, G., Alspector, J., Eds.; Morgan-Kaufmann: Burlington, MA, USA, 1994; Volume 6, pp. 128–135. [Google Scholar]
Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv 2017, arXiv:1703.04977. [Google Scholar]
Gast, J.; Roth, S. Lightweight Probabilistic Deep Networks. arXiv 2018, arXiv:1805.11327. [Google Scholar]
Jaulin, L.; Kieffer, M.; Didrit, O.; Walter, É. Applied Interval Analysis; Springer: London, UK, 2001. [Google Scholar] [CrossRef]
Kearfott, R.B.; Dawande, M.; Du, K.; Hu, C. Algorithm 737: INTLIB—a portable Fortran 77 interval standard-function library. ACM Trans. Math. Softw. 1994, 20, 447–459. [Google Scholar] [CrossRef]

Figure 1. Retrieval results for the direct problem. The plots in the upper panels show the histograms of the relative error over the prediction set, while the lower plots show the predicted values (red) and the uncertainty intervals (gray) versus the true values.

Figure 2. Retrieval results obtained with Method 1 using a deterministic network.

Figure 3. Retrieval results obtained with Method 1 using a Bayes-by-backprop network.

Figure 4. Retrieval results obtained with Method 1 using a dropout network.

Figure 5. Retrieval results obtained with Method 1 using a batch normalized network.

Figure 6. Retrieval results obtained with Method 2 using a dropout network.

Figure 7. Retrieval results obtained with Method 2 using a Bayes-by-backprop network.

Figure 8. Retrieval results obtained with Method 3 using assumed density filtering.

Figure 9. Retrieval results obtained with Method 3 using interval arithmetic. For a practical implementation of the algorithm, we use the interval arithmetic library INTLIB [60].

Table 1. Average relative error

E (ε_{x}) \pm \sqrt{E ({[ε_{x} - E (ε_{x})]}^{2})}

and standard deviation

E (σ_{x})

over the prediction set for the methods used to solve direct and inverse problems. The parameter x stands for the cloud optical thickness

τ

and cloud top height H. In the case of the inverse problem, the results correspond to Method 1 with a deterministic network (1a), Bayes-by-backprop (1b), dropout (1c), and batch normalization (1d); Method 2 with dropout (2a) and Bayes-by-backprop (2b); and Method 3 with assumed density filtering (3a) and interval arithmetic (3b).

Table 1. Average relative error

E (ε_{x}) \pm \sqrt{E ({[ε_{x} - E (ε_{x})]}^{2})}

and standard deviation

E (σ_{x})

over the prediction set for the methods used to solve direct and inverse problems. The parameter x stands for the cloud optical thickness

τ

and cloud top height H. In the case of the inverse problem, the results correspond to Method 1 with a deterministic network (1a), Bayes-by-backprop (1b), dropout (1c), and batch normalization (1d); Method 2 with dropout (2a) and Bayes-by-backprop (2b); and Method 3 with assumed density filtering (3a) and interval arithmetic (3b).

Method	x	Error	Std. Deviation
$\begin{matrix} Direct \\ Problem \end{matrix}$	$τ$	$- 2.94 \times 10^{- 2} \pm 6.67 \times 10^{- 2}$	$2.04 \times 10^{- 1}$
$\begin{matrix} Direct \\ Problem \end{matrix}$	H	$- 5.87 \times 10^{- 2} \pm 1.72 \times 10^{- 1}$	$3.66 \times 10^{- 1}$
1a	$τ$	$- 2.78 \times 10^{- 2} \pm 1.30 \times 10^{- 1}$	$8.92 \times 10^{- 1}$
1a	H	$- 1.96 \times 10^{- 2} \pm 1.29 \times 10^{- 1}$	$4.36 \times 10^{- 1}$
1b	$τ$	$3.39 \times 10^{- 2} \pm 1.66 \times 10^{- 1}$	$9.23 \times 10^{- 1}$
1b	H	$1.83 \times 10^{- 2} \pm 1.33 \times 10^{- 1}$	$8.11 \times 10^{- 1}$
1c	$τ$	$- 8.75 \times 10^{- 3} \pm 2.25 \times 10^{- 2}$	$3.23 \times 10^{- 1}$
1c	H	$3.45 \times 10^{- 3} \pm 4.11 \times 10^{- 2}$	$2.31 \times 10^{- 1}$
1d	$τ$	$1.01 \times 10^{- 2} \pm 2.41 \times 10^{- 2}$	$3.54 \times 10^{- 1}$
1d	H	$4.37 \times 10^{- 3} \pm 2.73 \times 10^{- 2}$	$2.29 \times 10^{- 1}$
2a	$τ$	$1.16 \times 10^{- 2} \pm 4.05 \times 10^{- 2}$	$8.24 \times 10^{- 1}$
2a	H	$1.63 \times 10^{- 2} \pm 4.21 \times 10^{- 2}$	$6.72 \times 10^{- 1}$
2b	$τ$	$3.10 \times 10^{- 2} \pm 1.38 \times 10^{- 1}$	$9.63 \times 10^{- 1}$
2b	H	$3.31 \times 10^{- 2} \pm 4.57 \times 10^{- 2}$	$4.88 \times 10^{- 1}$
3a	$τ$	$- 6.67 \times 10^{- 3} \pm 2.48 \times 10^{- 2}$	$6.55 \times 10^{- 1}$
3a	H	$8.18 \times 10^{- 4} \pm 3.18 \times 10^{- 2}$	$4.76 \times 10^{- 1}$
3b	$τ$	$- 7.08 \times 10^{- 3} \pm 2.53 \times 10^{- 2}$	$7.82 \times 10^{- 1}$
3b	H	$1.56 \times 10^{- 3} \pm 3.33 \times 10^{- 2}$	$6.53 \times 10^{- 1}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Doicu, A.; Doicu, A.; Efremenko, D.S.; Loyola, D.; Trautmann, T. An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing. Remote Sens. 2021, 13, 5061. https://doi.org/10.3390/rs13245061

AMA Style

Doicu A, Doicu A, Efremenko DS, Loyola D, Trautmann T. An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing. Remote Sensing. 2021; 13(24):5061. https://doi.org/10.3390/rs13245061

Chicago/Turabian Style

Doicu, Adrian, Alexandru Doicu, Dmitry S. Efremenko, Diego Loyola, and Thomas Trautmann. 2021. "An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing" Remote Sensing 13, no. 24: 5061. https://doi.org/10.3390/rs13245061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Overview of Neural Network Methods for Predicting Uncertainty in Atmospheric Remote Sensing

Abstract

1. Introduction

2. Theoretical Background

2.1. Point Estimates

2.2. Uncertainties

2.3. Bayesian Networks

2.3.1. Bayes by Backpropagation

2.3.2. Dropout

2.3.3. Batch Normalization

3. Neural Networks for Atmospheric Remote Sensing

3.1. Neural Networks for Solving the Direct Problem

3.2. Neural Networks for Solving the Inverse Problem

3.2.1. Method 1

3.2.2. Method 2

3.2.3. Method 3

3.3. Summary of Numerical Analysis

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI