Deep Gaussian Processes for Biogeophysical Parameter Retrieval and Model Inversion

Parameter retrieval and model inversion are key problems in remote sensing and Earth observation. Currently, different approximations exist: a direct, yet costly, inversion of radiative transfer models (RTMs); the statistical inversion with in situ data that often results in problems with extrapolation outside the study area; and the most widely adopted hybrid modeling by which statistical models, mostly nonlinear and non-parametric machine learning algorithms, are applied to invert RTM simulations. We will focus on the latter. Among the different existing algorithms, in the last decade kernel based methods, and Gaussian Processes (GPs) in particular, have provided useful and informative solutions to such RTM inversion problems. This is in large part due to the confidence intervals they provide, and their predictive accuracy. However, RTMs are very complex, highly nonlinear, and typically hierarchical models, so that often a shallow GP model cannot capture complex feature relations for inversion. This motivates the use of deeper hierarchical architectures, while still preserving the desirable properties of GPs. This paper introduces the use of deep Gaussian Processes (DGPs) for bio-geo-physical model inversion. Unlike shallow GP models, DGPs account for complicated (modular, hierarchical) processes, provide an efficient solution that scales well to big datasets, and improve prediction accuracy over their single layer counterpart. In the experimental section, we provide empirical evidence of performance for the estimation of surface temperature and dew point temperature from infrared sounding data, as well as for the prediction of chlorophyll content, inorganic suspended matter, and coloured dissolved matter from multispectral data acquired by the Sentinel-3 OLCI sensor. The presented methodology allows for more expressive forms of GPs in remote sensing model inversion problems.

tion outside the study area; and the most widely adopted hybrid modeling by which statistical models, mostly nonlinear and non-parametric machine learning algorithms, are applied to invert RTM simulations. We will focus on the latter. Among the different existing algorithms, in the last decade kernel based methods, and Gaussian Processes (GPs) in particular, have provided useful and informative solutions to such RTM inversion problems. This is in large part due to the confidence intervals they provide, and their predictive accuracy. However, RTMs are very complex, highly nonlinear, and typically

Introduction
Estimating variables and bio-geophysical parameters of interest from remote sensing images is a central problem in Earth observation [1,2,3]. This is usually addressed through a very challenging model inversion problem, which involves dealing with complex nonlinear input-output relations. In addition, very often, the goal is to invert metamodels, that is, combinations of submodels that are coupled together. In remote sensing, radiative transfer models (RTMs) describe the processes which occur at different scales (e.g. at leaf, canopy and atmospheric levels) with different complexities. The overall process is thus complicated, nonlinear and hierarchical, with different sources of uncertainty propagating through the system.
The inversion of such highly complex models has been attempted through several strategies. One standard approach consists on running a reasonable number of RTM simulations which generates the so called look-up tables (LUTs). Then, for a new input observation, one assigns the most similar parameter in the LUT. A second, more direct approach involves the direct physics-based inversion, which results in complex optimization problems. An alternative hybrid approach comes from the use of statistical approaches to perform the inversion using the LUT simulations. A review of approaches can be found in [4,3]. In recent years, the remote sensing community has turned to this type of statistical hybrid approaches for model inversion [3], mainly because of efficiency, versatility and the interesting balance between its data driven and physics-aware nature [5].
Approximating arbitrary nonlinear functions from data is a solid field of machine learning where many successful methods are available. Data-driven statistical learning algorithms have attained outstanding results in the estimation of climate variables and related geo-physical parameters at local and global scales [6,3]. These algorithms avoid complicated assumptions and provide flexible non-parametric models that fit the observations using large heterogeneous data. The fact is that a plethora of regression algorithms have been used. There exist traditional models such as random forests [7,8] and 3 standard feed-forward neural networks [9,10,11] as well as convolutional neural networks [12,13].
In the last decade, more emphasis has been put on kernel methods in general [6,14], and Gaussian Processes (GPs) in particular. There is a considerable amount of reasons for this. Firstly, GPs constitute a probabilistic treatment of regression problems leading to an analytical expression for the predictive uncertainty which is an attractive feature [15,16]. This also allows for effective error propagation from the inputs to the outputs as has recently been shown in [17]. Furthermore, GPs are not pure black box models because, through the use and design of appropriate covariance functions, one can include prior knowledge about the signal characteristics (e.g. nonstationarity, heteroscedasticity, etc.). The covariance hyperparameters are learned (inferred) from data so that the model is interpretable. For instance, by using the automatic relevance determination (ARD) covariance function [18], an automatic feature ranking can be derived from the trained model, thus leading to a explanatory model. These theoretical and practical advantages have recently translated to a wider adoption by the geoscience and remote sensing community in many applications and products, such as the spatialization of in-situ measurements and upscaling of carbon, energy and water fluxes [19].
Gaussian Processes have provided very good results for retrieval in all Earth science domains, be it land and vegetation parameter retrieval [20,21,22,23], ocean and water bodies modeling [24,25], cryosphere ice sheet modeling and process emulation [26], or atmospheric parameter retrieval [27].
Despite being successful in many different applications, standard GPs have two important shortcomings we want to highlight: • Computational cost. A standard GP, which stores and uses all the data at once, exhibits a high computational cost. These GPs scale cubically with the number of data points when training, and quadratically when doing prediction. This hampers their adoption in applications which involve more than just a few thousand input points.
• Expressiveness. GPs are shallow models 1 , so while accurate and flexible, their expressive power is limited when dealing with hierarchical structures. This is even worse due to the (ab)use of standard kernel functions like the exponentiated quadratic family (e.g. the RBF kernel is infinite-differentiable and tends to oversmooth functions).
The first limitation is typically addressed through sparse GPs [29], which have already been used in remote sensing applications [30]. In order to additionally tackle the second limitation, in this paper we introduce the use of Deep Gaussian Process (DGP) [31] to the field of remote sensing for the first time. A DGP is a cascaded and hierarchical model that captures more complex data structures, while still being able to scale well to millions of points. Our proposal is not incidental: the complexity of the processes involved in geosciences and remote sensing leads to highly hierarchical and modular models to be inverted. This calls for the application of the most innovative available techniques as shown in the following example. Fig. 1 compares the use of a standard GP and deep GP to model a hurricane structure. It becomes clear that, unlike GPs, the DGP can cope with the whirl Figure 1: Example of shallow versus deep GPs. Modeling a hurricane field from the coordinates using 1000 randomly selected training points in both cases. structure efficiently by combining different latent functions hierarchically.
In [35], we outlined the potential use of DGPs for surface level dew point temperature retrieval from sounding data. In this paper we extend that work in several ways: 1) we focus the analysis on large scale remote sensing problems, aiming for a complete treatment of the two aforementioned standard GP shortcomings; 2) provide a deeper formalization and more intuitive insight on the model for the practitioner; 3) give more empirical evidences of performance in ocean and land parameter retrieval applications, and us-6 ing different sensory data (optical sensors and microwave sounders); and 4) assess accuracy and robustness to sample sizes and problem dimensionality versus both standard and sparse implementations of GPs.
In short, this work exposes the DGP methodology to the remote sensing community for the first time and for a wide range of applications. The proposed DGP appears to be an excellent approach for model inversion.
Moreover, sticking to the GP framework is very convenient. GPs are based on a solid Bayesian formalism and inherit all properties of a probabilistic treatment: possibility to derive not only point-wise predictions but also confidence intervals, perform error quantification and uncertainty propagation easily, and optimize hyperparameters by log-likelihood maximization.
The remainder of the paper is organized as follows. Section 2 establishes notation, reviews the probabilistic modeling and inference of GP and sparse GP, and presents the deep GP model -mathematical details on modeling, inference, and prediction are provided in appendices A, B, and C. Section 3 provides the experimental results. We illustrate performance in prediction of surface temperature and dew point temperature (related to relative humidity) from superspectral infrared sounding data [36,37,38]; as well as for the estimation of predicting chlorophyll content, inorganic suspended matter, and coloured dissolved organic matter from simulated multispectral data acquired by Sentinel-3 OLCI sensor. Finally, Section 4 concludes the paper with summarizing remarks. 7

Probabilistic Model and Inference
In this section we provide a brief and graphical introduction to modeling and inference for Gaussian Processes (GP) and Deep Gaussian Processes (DGP) in supervised regression problems. We explain and graphically show the hierarchical structure of DGPs, and also explain how both GPs and DGPs, make use of sparse approximations to perform inference tasks. Mathematical details are deferred to appendices.
Gaussian Processes are non-parametric probabilistic state-of-the-art models for functions, and are successfully used in supervised learning. In geostatistics, GPs for regression is usually referred to as kriging. The main strength of GPs is their accurate uncertainty quantification, which is a consequence of its sound Bayesian formulation, yielding well-calibrated predictions and confidence intervals [39,34].
More specifically, for input-output data {(x i , y i ) ∈ R d ×R} n i=1 , a GP models the underlying dependence with latent variables {f i = f (x i ) ∈ R} n i=1 that jointly follow a Gaussian distribution p(f ) = N (f |0, K). The kernel matrix K = (k(x i , x j )) i,j encodes the properties (e.g. smoothness) of the modeled functions. The most popular standard kernel is the squared exponential one (or RBF), which is given by k(x, y) = γ · exp (−||x − y|| 2 /(2σ 2 )), with γ (variance) and σ (length-scale) called the kernel hyperparameters. Finally, in regression problems, the observation model of the outputs y i given the latent variables f i is usually defined by the Gaussian p(y i |f i , ρ 2 ) = N (y i |f i , ρ 2 ).
The variance ρ 2 is estimated during the training step, along with the kernel hyperparameters, by maximizing the marginal likelihood of the observed data.
Since the Gaussian prior p(f ) is conjugate to the Gaussian observation model, one can integrate out f and compute the marginal likelihood p(y) and the posterior p(f |y) in closed form (parameters are omitted for simplicity) [39]. However, this requires inverting the n × n matrix (K + ρ 2 I), which scales cubically, O(n 3 ) where n is the number of training data points. This constraint makes GP prohibitive for large scale applications, with n = 10 4 usually being considered the practical limit [40]. Here, sparse GP approximations become the preferred pathway to scale the desirable properties of GPs to larger datasets [29,41,42,40], and they will be reviewed in Section 2.1. Interestingly, we will see that DGPs preserve the scalability of sparse GP approximations (while achieving a higher expressiveness).
Additionally, GPs are limited by the expressiveness of the kernel function. Ideally, complex kernels could be tailored for different applications [39]. However, this is usually unfeasible in practice, as it requires a thorough application-specific knowledge. Moreover, it usually comes with a large amount of hyperparameters to estimate, which may cause overfitting. As a result, standard general-purpose kernels are normally considered in practice. Alternatively, DGPs allow for modeling very complex data through a hierarchy of GPs that only use simple kernels with few hyperparameters as building blocks (like the aforementioned RBF one, which will be used here).

Sparse GP approximations
In the last years, many different sparse GP approximations have been introduced in order to cope with increasingly large datasets [29,41,42,40]. Every layer is endowed with a standard RBF kernel. This produces very smooth functions in the first layer (i.e. a shallow GP, top plot). However, the concatenation of such simple GPs produces increasingly complex functions (middle and bottom plots, 2-layer and 3layer DGPs respectively). In particular, notice that DGP-3 captures sophisticated patterns that combine flat regions with high-variability ones, which cannot be described by stationary kernels. These ideas are behind the superiority of DGPs in Fig. 1.
Most of them resort to the notion of inducing points, a reduced set of m n latent variables which the inference is based on. More specifically, these inducing points u = (u 1 , . . . , u m ) are GP realizations at the inducing locations Z = {z 1 , . . . , z m } ⊂ R d , just like f is at the inputs X = {x 1 , . . . , x n }. All these sparse methods are grouped in two big categories, depending on where exactly the approximation takes place: in the model definition or in the inference procedure [42]. Both types of sparse GP will be compared against the deep GP in the experiments.
In the first group, the Fully Independent Training Conditional (FITC) [29] is the most popular approach. It uses the inducing points to approximate the GP model, and then marginalizes them out and perform exact inference. This yields a reduced O(nm 2 ) computational complexity (linear in the dataset size). Mathematical details for FITC are included in Appendix A.
In the second group, the Scalable Variational Gaussian Process (SVGP) [41] is one of the most widespread methods. It maintains the exact GP model, and uses the inducing points to introduce the approximation in the inference process through variational inference [43]. The mathematical details are included in Appendix B (which is devoted to DGPs because, as we explain in next paragraph, SVGP is equivalent to DGP with one layer). Since SVGP does not modify the original model, it is less prone to overfitting. However, if the posterior distribution is not well approximated within the variational scheme, its performance might become poorer. Therefore, both groups of methods are complementary, and in the machine learning community none of them is considered to consistently outperform the other [42]. An advantage of SVGP over FITC is its factorization in mini-batches, which allows for even Table 1: Summary of the main differences between the four GP-based models used in this work. VI = Variational Inference.

GP
FITC greater scalability. In this case, the computational cost is O(n b m 2 ), with n b the mini-batch size.
Interestingly, the second paradigm (exact model plus approximate inference) has proven to translate well to hierarchical concatenations of GPs, yielding the inference process for DGPs that is presented in next section.
This justifies that SVGP will be equivalently referred to as DGP (L = 1) hereafter. This is also graphically depicted in Fig. 3. Moreover, as explained before, Fig. 3 shows that u is integrated out in FITC after the model approximation, whereas it is maintained in DGP (L = 1), where an (approximate) posterior distribution is calculated for it. As a general summary, Table 1 shows the main differences between the four GP-based methods that will be used in this work (standard GP, sparse GP FITC, sparse GP SVGP, and deep GP), which are also represented in Fig. 3.

Deep Gaussian Processes
In standard (single-layer) GPs, the output of the GP is directly used to model the observed response y. However, this output could be used to define  is not realistic in practice. To overcome this problem, we present the recent inference procedure of [31], which keeps a strong conditional dependence in 13 the posterior by marginalizing out the aforementioned set of latent variables.
In exchange, analytic tractability is sacrificed. However, we will see that the structure of the posterior allows one to efficiently sample from it and use Monte Carlo approximations. As will be justified in Appendix B, this approach is called Doubly Stochastic Variational Inference [31].
DGPs can be used for regression by placing a Gaussian likelihood after the last layer. For notation simplicity, in the sequel the dimensions of the hidden layers will be fixed to one (this can be generalized straightforwardly, see both approaches [33,31]). But exact inference in DGPs is intractable (not only computationally expensive as in GPs), as it involves integrating out latent variables that are used as inputs for the next layer (i.e. they appear inside a complex kernel matrix). To overcome this, m inducing points u l at inducing locations z l−1 are introduced at each layer l. Interestingly, we will see that this sparse formulation also makes DGP scale well to large datasets, transferring the scalability of (shallow) sparse GP approximations like SVGP up to hierarchical structures.
For observed {X, y}, the regression joint DGP model is Here, f 0 = X, and each factor in the product is the joint distribution over (f l , u l ) of a GP in the inputs (f l−1 , z l−1 ), but rewritten with the conditional probability given u l . Notice that a semicolon is used to specify the inputs of the GP. The rightmost plot in Fig. 3 shows a graphical representation of the described model.
The Doubly Stochastic Variational Inference for this model is detailed 14 in Appendix B, see also [31]. Basically, assuming that the inducing points are enough to summarize the information contained in the training data, the model log-likelihood can be lower bounded by a quantity (called the Evidence Lower Bound, ELBO) that factorizes across data points. This allows for training in mini-batches, just as in SVGP, which makes DGPs scalable to large datasets. Finally, the prediction of the DGPs for a new test data point is included in Appendix C.

Implementation and practicalities
Several implementations of DGPs are currently available. In our experiments, we used the code integrated within GPflow (a GP framework built on top of Tensorflow), which is publicly available at https://github.com/ICL-SML/Doubly-Stochastic-DGP. We also used GPflow to train the standard GP and both sparse GP approaches: FITC and SVGP (equivalently, DGP with L = 1). In addition, for the sake of reproducibility, we provide illustrative code and demos in a Jupyter notebook at http://isp.uv.es/dgp/. The used data is available upon request.

Experimental results
The problem of translating radiances to state parameters is challenging because of its intrinsic high nonlinearity and underdetermination. We consider two such relevant remote sensing problems which together span both land and ocean application, namely 1) predicting surface level temperature and dew point temperature from infrared sounding data, and 2) predicting chlorophyll content, inorganic suspended matter and coloured dissolved matter from S3-OLCI data. Both problems involve inverting a model using large datasets of different sample size and dimensionality. In the first problem we compare DGPs with (shallow) standard and sparse GPs, highlighting the benefit of going deep in the GP setting. We also illustrate the predictive power of the DGP as a function of depth and data scale. The second problem aims at comparing the proposed model to another state-of-the-art method in a challenging real application. Specifically, we compare the performance of a DGP architecture with that of a state-of-the-art neural network method described in [44].

Surface temperature and moisture from infrared sounders
Temperature and water vapour are essential meteorological parameters for weather forecasting studies. Observations from high spectral resolution infrared sounding instruments on board satellites provide unprecedented accuracy of temperature and water vapour profiles. However, it is not trivial to retrieve the full information content from radiation measurements. Accordingly, improved retrieval algorithms are desirable to achieve optimal performance for existing and future infrared sounding instrumentation. The use of MetOp data observations has an important impact on several Numerical Weather prediction (NWP) forecasts. The Infrared Atmospheric Sounding Interferometer (IASI) sensor is implemented on the MetOp satellite series.
In particular, IASI collects rich spectral information to derive temperature and moisture [45,46]. EUMETSAT, NOAA, NASA and other operational agencies are continuously developing product processing facilities to obtain L2 products from infrared hyperspectral radiance instruments, such as IASI, AIRS or the upcoming MTG-IRS. Nonlinear statistical retrieval methods, and in particular kernel machines and Gaussian processes, have proven use-ful in retrieval of temperature and dew point temperature (humidity) recently [27,47,48]. Here we explore the use of deep Gaussian processes to retrieve surface temperature and moisture from IASI data.

Experimental results
The results of Experiment-1 are summarized in Fig. 5. We immediately see that there is a clear difference in RMSE between the shallow (GP-10K, FITC, DGP1) and the improved deep models (DGP2-4). As intuitively expected, the performance of all models increases with additional training data.
In this particular problem, it appears that the majority of additional complex structure is learned by going from 10 4 to 5 × 10 4 data points. As the DGP1 and FITC models are only approximations of the standard GP, it is to be expected that they perform worse when training on the same amount of data as the GP-10K. Nevertheless, when allowed to leverage more data, their fit improves and outperforms the GP-10K. It is not clear which of the two approximations is superior, as it varies with the number of training data. This agrees well with the literature, where this has been shown to depend on the data at hand [42]. The fact that single-layer approximations can outperform a standard GP when given enough training data underlines the importance of a model which is able to handle large-scale data. We can see from the results that the DGP both handles large datasets but also allows for higher model complexity and thus a better fit of the data. From observing the performance of DGPs with different numbers of layers, we can see that DGPs take advantage of their hierarchical structure and achieve lower RMSE with increasing depth. There is a considerable improvement when going from 2 to 3 layers, whereas the effect of going from 3 to 4 layers seems less significant.
We now turn to Experiment-2 for the comparison of the three different GP types, trained according to what their computational cost allows them: We compare the GP-10K with a FITC and a 3-layer DGP model both trained on 250 000 data points. Comparing the predictions on the ∼10 6 test points (obtained from the 11 orbits shown in Fig. 4) with the ground truth, we can analyze the quality of the predictive uncertainty provided by the models.
Each model provides, for a given test point y * , a Gaussian predictive distribution with a mean µ(x * ) and a variance σ 2 (x * ) -see  Table 2   in general provide much better performance than their shallow counterparts, both due to their ability to leverage large amounts of data and to model more 24 complex data than their shallow counterparts.

Ocean color parameters from optical sensors
Since the first remote sensing images of the ocean were taken, ocean color retrievals have been produced regularly with more or less accuracy, de- for the training and validation of a neural net approach [44]. A subset of this dataset has been already used to test five machine learning approaches, including simple Gaussian processes, for the determination of the three basic ocean colour parameters [56].

Data collection and pre-processing
Within the framework of the Case 2 eXtreme (C2X) project [57], inwater radiative transfer simulations for Sentinel 3-Ocean and Land Instrument (OLCI) were carried out with the commercial software Hydrolight [58].
For more detail on the source of the simulations see [59,60]. In the C2X project, the results of the simulations were grouped into five subcategories: One part of the S3-OLCI simulated dataset is put aside for validation purposes, with more than 4000 spectra per sub-category reserved exclusively for that. The C2X dataset contains simulations in 21 bands, from which a subset of 11 bands is used here for water quality parameter estimation as in [44]. This large dataset was used for the training and testing of the S3-OLCI Neural Network Swarm (ONNS) in-water processor [44]. ONNS is the result of blending various NN algorithms, each optimized for a specific water type, covering the largest possible variability of water properties including oligotrophic and extreme waters. Results from the DGP approach will be compared with the ones achieved by ONNS as part of the validation process.

Experimental setup
In the present experiment we have selected all data available for the five categories included in the C2X dataset. In total we have 10 5 records that we use to train and test the DGP models. As already mentioned, 11 out of the 21 S3-OLCI wavebands are selected as inputs, from 400 to 885 nm. Subsets of the data had to be used in the aforementioned works, as a standard GP cannot leverage data in the order of magnitude presented in the present paper. The DGP model was trained and validated using the same data, 8×10 4 training and 2×10 4 test data points respectvely, as the ONNS [44].
The results of the ONNS will be used as a source for comparison, that is, we compare our results with state-of-the-art deep learning methods globally accepted in the OC community.

Experimental results
A 3-layer DGP with 500 inducing points and 5 GPs in each hidden layer is trained. Adding more layers was found not to increase performance significantly. This amount of inducing points is frequently used in the GP literature (see e.g. [62]), and is set to deal with the higher complexity of the C2X dataset. Table 3 shows the comparison of the RMSE between DGP and ONNS dividing the test set by water type, and the total (bottom row).  We visualize the behaviour of measured against predicted values in Fig. 7.
In this figure the actual values (x-axis) vs. the DGP predictions (y-axis) are compared by variable and water type, in a similar fashion as was done by [44] with the ONNS results, with the exception of the non-log scale of our figure. In the following we make references to model predictions in regions of low numerical value which are better appreciated in the log-scale version of the figure located in Appendix E. Summarizing the results by water quality parameter: • CDOM: High uncertainties and distribution dispersion in Case 1 and scattering waters (C2S(X)) for very low CDOM values (<0.2 m −1 ). To separate the CDOM from suspended sediments using the absorption signal seems not to be easy. The correlation improves for absorbing waters (C2A(X)) for all values, with good uncertainty ranges for high values in C2A waters, with a gradual increase for the CDOM range higher than 15 m −1 in extreme cases.
• ISM: Shows almost a perfect correlation for C2S and C2SX, which are the scattering waters where the main component are suspended sediments and non-algal particles. CDOM dominated waters (C2A and C2AX) are not expected to have high non-organic suspended sediment content, which gives less relevance to the more dispersed and less accurate results in these water types. Some saturation is observed in absorbing waters for very low values < 0.1 g m −3 , which is better appreciated in Fig. 8, as well as for C1 waters, where dispersion is in general higher; however, it shows lower uncertainty values. This result is in line with the ONNS results in which "the retrieval performance is less skilled if the optical signal of minerals is weak due to low mineral concentrations as is the case in oligotrophic waters (C1)" [44].  Table 3). Uncertainty estimation was found to be a hard problem as previously shown in [44]. This is likely due to the fact that the dataset exhibits high variability in regions of both low numeric values and high ones (orders of magnitude from 10 −2 to 10 2 ).
Nevertheless, the DGP shows some advantages over the ONNS approach: Figure 7: Actual values (x-axis) versus values predicted by the DGP (y-axis) of test data, for each of the different water quality variables. The plots are divided by water type, and coloured according to predictive uncertainty: 2 * σ 2 DGP (x). We see that the predictive uncertainty is generally quite conservative, however it does tend to flag the point of high prediction error with higher predictive uncertainty. This behaviour is preferable compared to the more erratic uncertainty estimates in previous attempts at modeling this data [44].
it flags many of the outliers with high predictive uncertainty, and provides more conservative uncertainty estimates than the ONNS which assigns high uncertainty to predictions with low errors as well as vice versa.

Conclusions
We introduced the use of deep GPs and the doubly stochastic variational inference procedure for remote sensing applications involving parameter retrieval and model inversion. The applied deep GP model can efficiently handle the biggest challenges nowadays: dealing with big data problems while being expressive enough to account for highly nonlinear problems. We successfully illustrated its performance in two scenarios involving optical simulated Sentinel-3 OLCI data and IASI sounding data, and for different data sizes, dimensionality, and distributions of the target bio-geo-physical parameters.
We showed how DGP benefits from its hierarchical structure and consistently outperforms both full and sparse GPs in all cases and situations on the data at hand. Depth plays a fundamental role but the main increase in performance is achieved when going from shallow to deep Gaussian Processes, i.e. going from 1 to 2 layers. Higher number of layers showed little improvement and a certain risk of overfitting because of model over-parameterization.
Importantly, unlike a standard GP, the DGP model is inherently sparse and scales linearly with the training set size.
We would like to stress that the used DGPs could make a difference in the two applications introduced here, now and in the near future. For instance, neural networks made a revolution in the last decade for the estimation of atmospheric variables from infrared sounding data [9,10]. Later, in [11] we showed that kernel methods can outperform neural networks in these scenarios of high-input and output data dimensionality, but are more computationally costly and memory demanding when bigger datasets are available.
With DGPs these shortcomings are remedied: they are more expressive and accurate than standard kernel ridge regression (i.e. one-layer plain GPs), computationally much more efficient, and additionally provide a principled way to derive confidence intervals for the predictions. The problem of estimating temperature and moisture variables was successfully addressed with DGPs, and results were more accurate both over land/ocean, and for different latitudes, climatic zones and biomes. Furthermore, the experimental results for prediction of CDOM, ISM and Chl-a showed that it was possible to make a 3-layer DGP outperform a Neural Network based algorithm proposed in the literature. Although uncertainty quantification is difficult, as seen in [44], it is an advantage that training a DGP automatically yields uncertainty estimates, avoiding the need to train additional uncertainty neural networks.
The DGP model has demonstrated excellent capabilities in terms of accuracy and scalability, but certainly some future improvements are needed. It does not escape our attention that, as has been shown for deep convolutional neural networks, convolutional models can improve predictions when there is clear spatial structure [63]. Currently there are some efforts in the direction of convolutional GPs [64], but performance is still not comparable to a convolutional neural network (CNN). As shown here and in the literature, DGPs scale very well to large amounts of data, and have been trained on problems with 10 9 datapoints [31]. As of now, however, feed forward neural networks are still generally faster to train, which is not surprising as the DGP is learning a predictive distribution instead of a single point estimate. Lastly, when it comes to dealing with missing data and mixed data modalities, random forest regression has often been found to be more flexible than other methods. There is interesting work however addressing the missing data problem for GPs [65].
The more we incorporate machine learning in the pipeline when modeling physical systems, the more important uncertainty estimation and error propagation become. Encoding prior knowledge about input noise into a standard GP in a parameter retrieval setting, it has been shown that improved uncertainty estimation can be achieved [17]. The same approach can be imagined with a DGP model, which in the future could additionally improve its uncertainty estimates.
usually contain the exact posterior, the target of the optimization will be a lower bound on log p(y). This is the so-called Evidence Lower Bound (ELBO) [43].
The proposed family of posterior distributions in [31] is Notice that the first factor is the prior conditional of eq. KL(q(u l )||p(u l ; z l−1 )). (4) Observe that the second term is tractable, as the KL divergence between Gaussians can be computed in closed form [39]. However, the expectation in the first term involves the marginals of the posterior at the last layer, q(f L i ). Next we see that, whereas this distribution is analytically intractable, it can be sampled efficiently using univariate Gaussians.
Now, the expectation in the ELBO (recall eq. (4)) can be approximated with a Monte Carlo sample generated with eq. (6). This provides the first source of stochasticity. Since the ELBO factorizes across data points and the samples can be drawn independently for each point i, scalability is achieved through sub-sampling the data in mini-batches. This is the second source of stochasticity, which motivates the naming of this doubly stochastic inference scheme.
The ELBO is maximized with respect to the variational parameters m l , S l , 48 the inducing locations z l , and the kernel and likelihood hyperparameters θ l , ρ 2 (which, to alleviate the notation, have not been included in the equations). Notice that the complexity to evaluate the ELBO and its gradients is O n b m 2 (D 1 + · · · + D L ) , where n b is the size of the mini-batch used, and D l is the number of hidden units in each layer (which were set to one in this section). As mentioned before, this extends the scalability of the (shallow) sparse GP approximation SVGP [41] to hierarchical models, including the batching capacity.

C. Predictions
To predict in a new x * in DGPs, eq. (6)

D. Climate zones and biome classification
The climate zones data were taken from the Köppen-Geiger climate classification maps 5