Robust Prediction Interval estimation for Gaussian Processes by Cross-Validation method

Probabilistic regression models typically use the Maximum Likelihood Estimation or Cross-Validation to fit parameters. These methods can give an advantage to the solutions that fit observations on average, but they do not pay attention to the coverage and the width of Prediction Intervals. A robust two-step approach is used to address the problem of adjusting and calibrating Prediction Intervals for Gaussian Processes Regression. First, the covariance hyperparameters are determined by a standard Cross-Validation or Maximum Likelihood Estimation method. A Leave-One-Out Coverage Probability is introduced as a metric to adjust the covariance hyperparameters and assess the optimal type II Coverage Probability to a nominal level. Then a relaxation method is applied to choose the hyperparameters that minimize the Wasserstein distance between the Gaussian distribution with the initial hyperparameters (obtained by Cross-Validation or Maximum Likelihood Estimation) and the proposed Gaussian distribution with the hyperparameters that achieve the desired Coverage Probability. The method gives Prediction Intervals with appropriate coverage probabilities and small widths.


Introduction
Many approaches of supervised learning focus on point prediction by producing a single value for a new point and do not provide information about how far those predictions may be from true response values. This may be inadmissible, especially for systems that require risk management. Indeed, an interval is crucial and offers valuable information that helps for better management than just predicting a single value.
The Prediction Intervals are well-known tools to provide more information by quantifying and representing the level of uncertainty associated with predictions.
One existing and popular approach for prediction models without predictive distribution (e.g. Random Forest or Gradient Boosting models) is the bootstrap, starting from the Traditional bootstrap (Efron & Tibshirani, 1994;Heskes, 1997) to Improved Bootstrap (Li et al., 2018). It is considered as one of the most used methods (Efron & Tibshirani, 1994) for estimating empirical variances and for constructing Predictions Intervals, it is claimed to achieve good performance under some asymptotic framework.

by using the combinational Coverage
Width-based Criterion (CWC) as a metric to identify model's parameters or define a loss function with a Lagrangian controlling the importance of the width and coverage. Pang et al. (2018) suggest the Receiver Operating Characteristic curve of Prediction Interval (ROC-PI), a graphic indicator that serves as a trade-off between the intervals width and CP for identifying the best parameters.
Unlike Ensemble methods or Neural Networks, there exist several prediction models with a probabilistic framework like the Gaussian Processes (GP) model (Rasmussen & Williams, 2005) which are able to compute an efficient predictor with associated uncertainty. These models are more suitable for uncertainty quantification. They provide a predictive distribution with both point prediction and interval estimation and do not require any empirical approach such as the bootstrap. In most cases, the predictive distributions of GP models are obtained either with a plug-in method that takes the Maximum Likelihood Estimator (MLE) (Mardia & Marshall, 1984;Stein, 1999) of the model's hyperparameters or by using a Full-Bayesian approach that takes into account the posterior distribution of the hyperparameters and propagates it into the predictive distribution. However, both methods suffer from some limitations. Indeed, the MLE approach works well only when the model is well-specified and may fail in case of model misspecification (Bachoc, 2013b). At the same time, the Full-Bayesian approach is very complex to implement, typically with a Markov chain Monte Carlo (MCMC) algorithm and is sensitive to the choice of the prior distribution of the hyperparameters (Filippone et al., 2013;Muré, 2018). On the other hand, the calibration of Prediction Intervals is little studied in the literature. Lawless & Fredette (2005) proposed a frequentist approach to predictive distribution to build and calibrate the Prediction Intervals. However, to the best of our knowledge, this approach has not yet been extended to models with a predictive distribution and we do not have any guarantees that it can work in the case of a misspecified model. Furthermore, improving the modelling of the covariance function seems to be efficient in overcoming the issue of a misspecified model. However, it may lead to complex covariance models and, consequently, severe difficulties in estimating the covariance function's hyperparameters, especially in high dimensions. Moreover, sometimes, it is challenging to find proper modelling without further knowledge of the system and the sources of uncertainty. In this work, we propose a method based on Cross-Validation (CV) on the GP model to address the problem of model misspecification and calibrate Prediction Intervals by adjusting the upper and lower bounds to satisfy the desired level of CP. The method gives Prediction Intervals with appropriate coverage probabilities and small widths.
The paper is organized as follows. Section 2 formulates the problem of Prediction Intervals estimation. Section 3 introduces the Gaussian Process regression model and its training methods. In Section 4, we present a method for estimating robust Prediction Intervals supported by theoretical results. We show in Section 5 the application of this method to academic examples and to an industrial example. Finally, we present our conclusions in Section 6.

Problem formulation
We consider n observations of an empirical model or computer code f . d ) ∈ D. The outputs are denoted by y = y (1) , . . . , y (n) ∈ R n with y (i) = f (x (i) ). We seek to estimate the unobserved function x ∈ D → f (x) from the data y and make accurate predictions with the associated uncertainty.
Formally, let assume that f is a realization of random process Y and let Y (x) be the value of model output at a point x ∈ D, let α ∈ [0, 1] describes the nominal level of confidence.We wish to estimate the interval PI 1−α with respect to the type II CP (the conditional Coverage Probability given the training set) such that the probability is as close as possible to 1 − α. In most cases, PI 1−α is a two-sided interval delimited by two bounds at x ∈ D where y α/2 (x) =ỹ(x) + z α/2 ×σ(x) is the lower bound, y 1−α/2 (x) =ỹ(x) + the predictive mean and variance.
In the framework of kriging, the prior distribution of the process Y is Gaussian characterized by a mean and covariance. The Cumulative Distribution Function (CDF) of the predictive variable Y (x) given X and y is well-defined and continuous with the Gaussian distribution. The quantile function is defined then as the inverse of the CDF and the quantiles z α/2 and z 1−α/2 are fully characterized. Thus, estimating the interval PI 1−α in equation (2)

Modelling with Gaussian Processes
We use the GP model to learn the unobserved function f . It is a Bayesian non-parametric regression (see Tipping (2004) for Bayesian inference) which employs GP prior over the regression functions. It will be converted into a posterior over functions once some data has been observed. In the kriging framework (Rasmussen & Williams, 2005;Stein, 1999), Y is assumed a priori to be a GP with mean µ(x) and covariance function k(x, x ) + σ 2 1{x = x } for all x, x ∈ D. σ 2 ≥ 0 is the variance of measurement error, also called the nugget effect.

The mean and covariance functions
The assumption made on the existing knowledge of the model Y and the mean function µ defines three sub-cases of kriging • The Simple Kriging : µ is assumed to be known, usually null µ = 0.
• The Ordinary Kriging : µ is assumed to be constant but unknown.
• The Universal Kriging : µ is assumed to be of the form x j , j = 1, . . . , p − 1) and unknown scalar coefficients β j .
The covariance function k is a map that is symmetric positive semi-definite, usually stationary k(x, x ) = r(x − x ). The most commonly used kernel in R is the Matérn kernel class given by for x, y ∈ R. Here σ 2 > 0 is the amplitude, θ > 0 is the length-scale, Γ is the complete Gamma function and K ν is the modified Bessel function of the second kind. (σ 2 , θ) are called hyperparameters. Some particular cases of Matérn kernel are when ν = 1 2 (Exponential), ν = 3 2 (Matérn 3/2), ν = 5 2 (Matérn 5/2) and ν → ∞ (Gaussian or Squared-Exponential).
The choice of kernels is important in the kriging scheme and requires prior knowledge of the smoothness of the function f . For example, the choice of the Gaussian kernel assumes that the function is very smooth of class C ∞ (infinitely differentiable) which is often too strict as a condition. A common alternative is the functions Matérn 5/2 or Matérn 3/2 kernel It is possible to build high-dimensional covariance models in R d based on classical kernels in R. In particular, the Matérn anisotropic geometric model (radial model), which we consider in the following in this paper, defined as where r is a Matérn kernel R as defined in (3) and θ = (θ 1 , . . . , θ d ) the lengthscale vector. The described method can be applied to other forms of covariance models like the tensorized product model with d-dimensional kernels (as the product of kernels is also a kernel) or the Power-Exponential model. In the following sections, instead of writing k σ 2 ,θ , we denote simply k when there is no possible confusion.

Gaussian Process Regression Model
The prior distribution of Y on the learning experimental design X is multivariate Gaussian where • β = {β 1 , . . . , β p } ∈ R p are the regression coefficients.
• K = k(x (i) , x (j) ) 1≤i,j≤n + σ 2 I n ∈ R n×n is the covariance matrix of the learning design X.
Hypothesis H 1 : In the case of ordinary or universal kriging, we assume that n ≥ p, F is a full rank matrix, and e ∈ Im F where e = (1, . . . , 1) .

Remark 1.
In Ordinary Kriging, the hypothesis H 1 is always satisfied. In the Universal Kriging, the hypothesis e ∈ Im F is satisfied as soon as the constant function f 0 (x) = C is included in the chosen family of functions f i .

Prediction
The Gaussian conditioning theorem is useful to deduce the posterior distribution. By considering a new point x new , it can be shown that the predictive distribution of Y (x new ) conditioned on the learning sample X, y is also Gaussian where, in the case of Ordinary or Universal Kriging and by denoting f trend (x new ) = (f j (x new )) p−1 j=0 , y(x new ) andσ 2 (x new ) are given by the Best Linear Unbiased Predictor (BLUP), and, We refer to Santner et al. (2003) in Chapter 4 for a detailed proof of Equations (6-9). In particular, we note that the additional term of the predictive variance in (8) is due to the propagation of the non-informative improper form of prior distribution on the estimation of β.
The most outstanding advantage of GP model compared to other models relies on the previous equations (7) and (8). The predictive distribution can be used for sensitivity analysis (Oakley & O'Hagan, 2004) and uncertainty quantification instead of costly methods based on Monte Carlo algorithms. Other possible considerations and extensions of GP modelling are described in (Currin et al., 1991;Rasmussen & Williams, 2005).
Given a GP regression model and a point x new ∈ D, the posterior distribution of prediction in (6) can be standardized intõ By considering the standardized variableZ(x new ), the α-quantiles z α are those of the standard normal distribution : q 1−α/2 = Φ −1 (1 − α/2) and q α/2 = such that the Prediction Intervals PI 1−α in (2) can be written as which gives a natural definition for y α/2 and y 1−α/2

Training model with Maximum Likelihood method
Constructing a GP model and computing the kriging mean and variance as shown in (7) and (8) implies estimating the nugget effect σ 2 and the covariance parameters (σ 2 , θ). Here, we assume that σ 2 is known or has been estimated by the method proposed in Iooss & Marrel (2017) for instance.
The Maximum Likelihood Estimator (MLE)σ 2 M L andθ M L of σ 2 and θ is given by (Santner et al., 2003) The MLE method is optimal when the covariance function is well-specified (Bachoc, 2013a) (i.e. when the data y comes from a function f that is a realization of a GP with covariance function that belongs to the family of covariance functions in section 3.1).
However, there is no guarantee that the MLE method would perform optimally as this method is poorly robust with respect to model misspecifications. Besides, training and assessing the quality of a predictor should not be done on the same data (Hastie et al. (2009) in chapter 7). In particular, the MLE method does not show how well the model will do when it is asked to make new predictions for data it has not already seen. The CV method represents an alternative to estimate the covariance hyperparameters (σ 2 , θ) for prediction purposes (Zhang & Wang, 2010;Bachoc, 2013a) and has the advantage of being more efficient and robust when the covariance function is misspecified (Bachoc, 2013a).

Training model with Cross-Validation method for point-wise prediction
We consider the same learning set of n observations D learn = (X, y) = {(x (i) , y (i) ), i ∈ {1, . . . , n}} and we assume that the value of σ 2 ∈ [0, +∞) is known. The Leave-One-Out method (i.e. n-Cross-Validation) consists in predicting y (i) by building a GP model, denoted GP −i and trained on D −i = {(x (j) , y (j) )} j∈{1,...,n}\{i} . The obtained prediction mean and variance are functions of parameters (σ 2 , θ) as shown in (7) and (8) and are used to assess the predictive capability of the global GP model.
In the case of the Leave-One-Out method, the Mean Squared prediction Error (MSE) is used to assess the quality of the point-wise prediction (See Wallach & Goffinet (1989) for more details about this metric) of the GP model, it can be expressed as whereỹ i andσ 2 i are the Leave-One-Out predictive mean and variance of f (x (i) ) by a GP model trained on D −i with the hyperparameters (σ 2 , θ).
Hypothesis H 2 : Let (e i ) n i=1 be the canonical basis of R n . We assume that e i ∈ Im F for all i ∈ {1, . . . , n}.
Let K be the matrix defined by For all i ∈ {1, . . . , n}, we have K i,i > 0 by Lemma 3 (see Appendix A), and, in the case of Ordinary or Universal Kriging, the Virtual Cross-Validation formulas (Dubrule, 1983) of the predictive meanỹ i and varianceσ 2 i are given by With the presence of the nugget effect, the GP regressor does not interpolate the training data y but approximates them as best as possible. The Leave-One-Out method looks for the best approximation by minimizing the LOO M SE criterion. The criterion (14) can be written with explicit quadratic forms in y Note that in the absence of the nugget effect σ 2 = 0, K is of the form σ −2 R θ where R θ does not depend on σ 2 . The predictive varianceσ 2 M SE can then be computed by the following explicit quadratic form (Bachoc, 2013a) and the optimal length-scale vectorθ M SE is obtain by solvinĝ

Full-Bayesian approach
In this subsection, we consider the full-Bayesian treatment of GP models (Williams & Barber, 1998). Indeed, the full-Bayesian approach integrates the uncertainty about the unknown hyperparameters and assumes a prior on the hyperparameters (σ 2 , θ) ∼ p(σ 2 , θ). Consequently, the probability density x new can be expressed as an integral over the hyperparameters (we omit the conditioning over inputs X and x new ): where p(y new | σ 2 , θ is the pdf of Y (x new ) given y, σ 2 and θ, and p(σ 2 , θ | y) ∝ p(y | σ 2 , θ)p(σ 2 , θ) is the hyperparameters' posterior distribution.
The implementation of the full-Bayesian approach requires the evaluation of the previous integral and the posterior p(σ 2 , θ | y), which cannot be computed directly. It is common to use Markov chain Monte Carlo (MCMC) methods for sampling and inference from the posterior distribution of the hyperparameters to overcome this issue, using, in particular, the Metropolis-Hastings (MH) algorithm (Robert & Casella, 2004) or Hamiltonian Monte Carlo (HMC) (Neal, 1993(Neal, , 1996.
Therefore, the predictive distribution is obtained by Monte Carlo where N denotes the MCMC sample size and (σ 2 i , θ i ) is the i-th sample drawn from the posterior distribution p(σ 2 , θ | y). (6) for each i = 1, . . . , N and build the Prediction Intervals PI 1−α by taking the empirical quantiles of order α/2 and Note that the plug-in approaches (e.g. the MLE method in 3.4) consider (21) and replace p(σ 2 , θ | y) by a Dirac distribution centered on a value such as (σ 2 M L ,θ M L ) that maximizes the likelihood function.

Prediction Intervals estimation for Gaussian Processes
Using the Cross-Validation method, the MSE hyperparameters (σ 2 M SE ,θ M SE ) are obtained from a point-wise prediction metric and do not focus on Prediction Intervals neither on quantifying the uncertainty of the model. For these purposes, using the CP is more appropriate.
The Coverage Probability (CP) is defined as the probability that the Prediction Interval procedure will produce an interval that captures what it is intended to capture (Hong et al., 2009). In the Leave-One-Out framework, we keep the notations ofỹ i andσ 2 i : the predictive mean and variance on x (i) ∈ X using the learning set D −i = {(x (j) , y (j) )} j∈{1,...,n}\{i} . We define then the Leave-One-Out CPP 1−α as the percentage of observed values y belonging to Prediction Intervals where q a is the a-quantile of the standard normal distribution and 1{A} is the indicator function of A. We introduce the Heaviside step function h The Leave-One-Out CPP 1−α can be written as When the model is well-specified, the coverage of the Prediction Intervals PI 1−α is optimal as the predictive distribution is fully characterized by the Gaussian distribution (see section 3.1), each term of the right-hand side of (25) is an unbiased estimator of the probability and Conversely, if the model is misspecified, each predictive quantile, needs to be quantified properly with respect to the normal distribution quantile so that the CP as described in section 2 achieves the desired level.
Let a ∈ (0, 1/2) ∪ (1/2, 1) describe a nominal level of quantile. We define the quasi-Gaussian proportion ψ a as a map from [0, +∞) whereỹ i andσ i are the predictive mean and variance at x (i) using the learning set D −i and the hyperparameters (σ 2 , θ). Given the Virtual Cross-Validation formulas (Dubrule, 1983), ψ a can be written in terms of the covariance matrix The quasi-Gaussian proportion ψ a describes how close the a-quantile q a of the standardized predictive distribution is to the level a (ideally, it should correspond to a). Therefore, the objective is to fit the hyperparameters (σ 2 , θ) according to the quasi-Gaussian proportions and to find two pairs (σ 2 , θ) and (σ 2 , θ) such that ψ 1−α/2 (σ 2 , θ) = 1 − α/2 and ψ α/2 (σ 2 , θ) = α/2. This allows us to get the optimal Leave-One-Out CP by respecting the nominal confidence

Presence of nugget effect
In this section, we assume σ 2 > 0. The quasi-Gaussian proportion ψ a is, however, piecewise constant and can take values only in the finite set {k/n, k ∈ {0, . . . , n}}. We first need to modify the problem ψ a σ 2 , θ = a. Let δ > 0, we If a < 1/2 we define Let δ > 0 be a small enough so that δ < q a if a > 1/2 (respectively, δ < q 1−a if a < 1/2) in such a way that h + δ (q a ) = 1 (respectively, h − δ (q a ) = 0). We consider the problem and we denote by A a,δ the solution set of the problem (33) We assume that k < na if a > 1/2 and k > na if a < 1/2.

Remark 2. The hypothesis H 3 is typically satisfied in Ordinary and Universal
Kriging. Indeed, Π is the projection on the space (Im F) ⊥ and is expected to remove the trend of the model. It is reasonable to think that (Πy) is centered and that If σ 2 is smaller than σ 2 , then we should also have Proof. In Appendix A.
The challenge now is to identify and choose wisely the optimal solutions (σ 2 opt , θ opt ) ∈ A a,δ . Some authors (Khosravi et al., 2010) suggest the mean Prediction Intervals width (MPIW) of Prediction Intervals PI 1−α as an additional constraint to reduce the set of solutions. However, this constraint may not work when dealing with quantile estimation because the lower bound of the corresponding interval may be infinite.
Instead, we will compare these solutions with MLE's solution (σ 2 M L ,θ M L ) (subsection 3.4) or MSE-CV solution (σ 2 M SE ,θ M SE ) (subsection 3.5) and we will take the closest pair (σ 2 opt , θ opt ) by using an appropriate notion of similarity between multivariate Gaussian distributions. Ideally, we aim to solve the following problem where d is a continuous similarity measure of hyperparameters (σ 2 , θ) operating on the mean vector m and covariance matrix K, and (σ 2 (13) or (18). The resolution of the problem (37) may be too costly and heavy to solve when the dimension is high, say d ≥ 10. An alternative is to apply the relaxation method where we redefine this optimization problem of θ from (0, +∞) d to (0, +∞) by shifting the length-scale vector θ 0 by a parameter λ ∈ (0, +∞).
Let θ 0 denote a solution of the problems (13) or (18) and for λ ∈ (0, +∞), Hypothesis H 4 : The set-valued mapping (the so-called correspondence function) H δ : (0, +∞) → P((0, +∞)), where P(S) denotes the power set of a set S, In the kriging framework, σ 2 should be as small as possible to reduce the uncertainty of the model, a natural choice of σ 2 opt is Concerning the choice of d, one known similarity measure between probability distributions is the Wasserstein distance, widely used in optimal transportation problems (see Chapter 6 of Villani (2009) for more details). In case of two Gaussian random distributions N (m 1 , K 1 ) and N (m 2 , K 2 ), the second Wasserstein distance is equal to where, in our setting, and we define the similarity measure d as The choice of the second Wasserstein distance d 2 and σ 2 opt makes the Prediction Intervals PI 1−α shorter without the need for an additional metric like the MPIW and without modifying the distribution of the obtained model significantly. We will see in Section 5.2 that, empirically, the barycenters of Prediction Intervals are not far from the predictive means obtained by MLE or MSE-CV methods.
The relaxed optimisation problem in (33) for the quantile estimation is given by the problem P λ Proposition 3. Under hypotheses H 1 to H 4 , the function L : (0, +∞) → R + is continuous and coercive on (0, +∞). The problem P λ admits at least one global minimizer λ * in (0, +∞).

Proof. See Appendix A.
Remark 3. The coercivity of the function L is guaranteed by the hypotheses H 1 to H 3 (see Appendix A). The function L is also upper semi-continuous (Zhao, 1997 Let β opt denote the corresponding regression parameter The purpose of this resolution is to create a GP model with hyperparameters ( β opt , σ 2 opt (λ * ), λ * θ 0 ) able to predict the quantileỹ a such that a proportion a of true values are belowỹ a with respect to the constraint of quasi-Gaussian proportion ψ a (see Figure 1). Finally, the Prediction Intervals PI 1−α will be obtained using two GP models built with the same method, one for the upper quantile 1 − α/2 with optimal relaxation parameter λ * and the other for the lower quantile of α/2 with parameter λ * . The CP of PI 1−α is optimal and insured by respecting the coverage of each quantile as shown in (25). In the following, we call this method Robust Prediction Intervals Estimation (RPIE).

Absence of nugget effect
When the nugget effect is null σ 2 = 0, the set of solutions A a,δ is still non-empty because one can show that, for θ in the neighborhood of 0 ∈ R d , the problem ψ (δ) a (σ 2 , θ) = a has a solution σ 2 ∈ (0, +∞) (see Appendix B). In particular, the correspondence function H δ is non-empty valued for λ > 0 small enough and it may be empty-valued for some large λ ∈ (0, +∞). We may think, however, that H δ is non-empty valued and that σ 2 opt (λ) exists for λ close to one. Indeed, assume for a while that the model is well-specified, that is, there exist hyperparameters (β * , σ 2 * , θ * ) such that y corresponds to a realization of a random vector Y ∼ N (Fβ * , σ 2 * R θ * ). The existence of H δ (λ) and σ 2 opt (λ) depend on the condition k λ ≤ na, where k λ is the integer defined by Since R θ * Y is centered, we can anticipate that Hence, the condition n/2 < k λ ≤ na should be satisfied in a neighborhood of λ = 1 since θ 0 should be close to θ * . Finally, eventhough the function L is not defined on (0, +∞), we can solve (42) by a grid search method on its domain.

Test cases with analytical functions
In this section, we give three numerical examples to illustrate Prediction Intervals estimation by the RPIE method. We show that for the Wing-Weight function, the model is well-specified as the CP is optimal for different levels, hence, no robust calibration of Prediction Intervals is required. However, for Zhou (1998) and Morokoff & Caflisch (1995) functions where the model is misspecified and for a given confidence level α, we apply the RPIE method as described in section 4 to estimate both upper and lower bounds of Predictions Intervals. The following metrics : the Leave-One-Out CPP 1−α defined in (25) Bayesian approach or the RPIE method. They can be used either for point-wise prediction comparison (Q 2 will be given in some cases for information, it does not represent the main metric of this section): or for quantifying the goodness of Prediction Intervals: and, where The components x i are assumed to vary over the ranges given in Table 1 (see Forrester et al. (2008) and Moon (2010) for details) We create an experimental design X of n = 600 observations and d = 10 We generate the response y = y (1) , . . . , y (n) such that y (i) = f (x (i) ) + (i) with f defined in (51) and (i) are sampled i.i.d.
with the distribution N (0, σ 2 = 25). Here the nugget effect is estimated with the methodology described in Iooss & Marrel (2017) and the covariance kernel is the Matérn 3/2.
The GP model is trained on 75% of the data (25% of data is left for testing).
The diagnostics of the model are presented in Table 2 with the metrics described above. The accuracy Q 2 is moderate for MLE and Full-Bayesian methods. The MSE-CV does much better, an expected result since the MSE-CV method is more adapted for point-wise prediction criterion. However, the Leave-One-Out CP P 1−α for two different levels α = 5%, 10% is far from the required level, which means that they were poorly estimated with point-wise prediction criterion.
In addition, Table 2 shows in particular that the model is well-specified for Matérn 3/2 correlation kernel with the MLE method since the CPs are optimal and close to the required level. This claim is empirical and can be verified either by comparing the standardized predictive distribution with the standard normal distribution as in Figure 1 or using Shapiro & Wilk (1965) In Example 2, we consider an experimental design X of n = 600 observations and d = 10 correlated inputs. Each observation has the form is the CDF of the standard normal distribution, z (i) are sampled from the multivariate distribution N (0, C) and C ∈ R d×d is the following covariance matrix: The response vector y is generated as y (52)  The model is not well-specified as Example 1 and the Shapiro & Wilk (1965) test gives p-value = 1.253 10 −7 . Table 3 summarizes the results of MLE and MSE-CV estimations before and after applying the RPIE, compared with the Full-Bayesian approach. The accuracy Q 2 of both models is satisfactory and is slightly improved when using the MSE-CV method. However, before applying the RPIE, the Prediction Intervals are overestimated for both models. The CP does not correspond to the required level of 90%, and the MSE-CV model performs even worse. We note that the Full-Bayesian approach does not improve the quality of estimated Prediction Intervals for the same reason as explained before: the hyperparameters' posterior distribution p(σ 2 , θ | y) is concentrated around the MLE estimator and the performances of both approaches are similar.
We will see that this claim is also valid in Example 3.
We now address the problem of Prediction Intervals Estimation for each  We consider now the Prediction Intervals built according to the RPIE method.
In  the testing set is very close to this level. Concerning the computational time, it appears that applying the RPIE method to MLE or MSE-CV solutions counts for a short computational time (only a few minutes to run in this example). The Full-Bayesian approach is still computationally heavy, as already discussed in the previous example and section 1.
Example 2 is a case of misspecified model with noise in which the CP obtained by MLE, MSE-CV and Full-Bayesian methods are not good. The RPIE method fulfills its purpose: its reduces Prediction Intervals width and improves the robustness of Prediction Intervals in such a way that they achieve the optimal coverage rate.

Example 3: Misspecified model without noise -Zhou function -
The Zhou (1998) function, considered initially for the numerical integration of spiky functions, is defined on [0, 1] d by where In Example 3, we create an experimental design X similar to Example 1, containing n = 600 and d = 10 variables where observations are sampled independently with uniform distribution over [0, 1] d . As the Zhou function in (53) takes some high values, we generate the response y by applying a logarithmic transformation: Note that there is no measurement noise here. We will address two situations: In the first setting, we assume that we know that there is no measurement noise, we impose that there is no nugget effect in the model σ 2 = 0 and we consider the Exponential anisotropic geometric correlation model (ν = 1/2) as covariance model. In the second setting, we assume that we do not know whether there is measurement noise and we estimate the nugget effect of the model. We consider consequently the Matérn 3/2 anisotropic geometric correlation model (ν = 3/2), a reasonable choice for a smooth covariance model when assuming a nugget effect (See Appendix B for further discussion).
In Table 4, the models are good in terms of accuracy Q 2 with a small advantage for the Full-Bayesian approach, but none of them satisfies the required level of CP, especially the MSE-CV model with an extremely low CP. As we do not estimate the nugget effect in this setting, the computational time of the MLE method is low (a few seconds) where the RPIE still takes a couple of minutes, as in Example 2. We will notice (also in the industrial application) that the computational time after the RPIE method is generally twice to three times the computational time of MLE method when there is a nugget effect.
When proceeding similarly as Example 2 to build robust Prediction Intervals by the RPIE model, the result is striking in  Table 4 also shows that the CPs for the testing set are close to their desired value 1−α = 90%.
In the second setting, the nugget effect is estimated toσ 2 = 1.71 10 −2 by SdPIW 1−α 1.06 10 −1 3.48 10 −2 1.00 10 −1 1.00 10 −1 1.08 10 −1 Ct 10s 31min 2s 2min 31s 33min 32s 4h 56min 15s  Table 5. The accuracy is still satisfying and similar to the previous setting, but the CP is close to 100%, meaning that the Prediction Intervals of all three methods are overestimated. Table 5 shows that, with the RPIE method, we reduce Prediction Intervals width, five times shorter than Prediction Intervals of the MSE-CV solution, and three shorter than Prediction Intervals of the MLE solution. The variances of the obtained Prediction Intervals are between MLE and MSE-CV Prediction Intervals variances. One can notice also a decrease of 50% of the MPIW compared to the first setting, while maintaining an optimal coverage of 1 − α = 90%.
Example 3 illustrates a case of misspecified model without noise where the RPIE method adjusts Prediction Intervals width and improves the robustness of Prediction Intervals so that the CP is respected. One can also conclude that it is preferable to consider a nugget effect for shorter Prediction Intervals and SdPIW 1−α 6.88 10 −2 2.56 10 −1 4.73 10 −2 5.27 10 −2 6.97 10 −2 Ct 1min 20s 31min 22s 3min 39s 33min 37s 4h 25min 59s optimal coverage.

Application to Gas production for future wells
In this section, we illustrate the interest of the RPIE method in energy production forcasting. It includes many industrial applications such as battery capacity, wind turbine, solar panel performance or, more specifically, unconventional gas wells where a decline in production may be observed. We show that the RPIE can estimate robust Prediction Intervals, covering the lower bounds of level α/2 = 10% (pessimistic scenario) and the upper bounds of level 1 − α/2 = 90% (optimistic scenario).
Indeed, a fundamental challenge of Oil and Gas companies is to predict their assets and their production capacities in the future. It drives both their exploration and development strategy. However, forecasting a well future production is challenging because subsurface reservoirs properties are never fully known.
This makes estimating well production with their associated uncertainty a crucial task. The agencies PRMS and SEC (Society of Petroleum Engineers, 2022; Securities & Commission, 2010) define specific rules 1P/2P/3P for reserves estimates based on quantile estimates: • 1P: 90% of wells produce more than 1P predictions (proven).
These rules are to be disclosed to security investors for publicly traded Oil and Gas companies and aim to provide investors with consistent information and associated value assessments. Many Machine Learning algorithms have shown their efficiency in estimating the median 2P (e.g. using GP with MLE method, or MSE-CV if interested more in point-wise predictions) but failed to estimate 1P and 3P. Thus, the objective of this study is to build a proper estimation of the quantiles p 90% and p 10% by applying the RPIE method described in section 4.
Our dataset, field data, is derived from unconventional wells localized in the We standardized the data (X, y) and we divided into a 60% − 20% − 20% partition of three datasets: a training set and two validation sets. The response y (Cumulative Production over 12 months in MCFE) is noisy due to the uncertainty of the reservoir parameters in the field. The nugget effect σ 2 is unknown but estimated to σ 2 = 0.16 using the method of Iooss & Marrel (2017).
Based on results drawn from the previous subsection and for practical reasons (particularly the computational cost of methods), we will present only the application of the RPIE method on the MLE solution. Finally, it appears that the GP model requires some computing resources to be built and to estimate its hyperparameters by MLE method.
In the following, we define the MLE's solution as reference θ 0 =θ M L in the optimization problem (42) for the quantiles α/2 = 10% and 1 − α/2 = 90% and we build robust Prediction Intervals confidence level 1 − α = 80% with the RPIE method. The results are presented in Table 7. When considering the estimated Prediction Intervals by the RPIE method, we can see the CP is optimal for the training set and is close to 1 − α = 80% for both validation sets. Therefore, we fulfil the objective of estimating the upper and lower bounds, the obtained quantiles p 90% and p 10% respect 1P and 3P rules as mentioned above. Finally, in Figure 3a, we present the estimated Prediction Intervals defined by the upper bounds P 90 and lower bounds P 10 against the true values of y on Validation set I. The x-axis designs well's indices ordered with respect to the barycenters of the Prediction Intervals (engineers choose this representation for interpretation purposes). We can see that the estimated Prediction Intervals by the MLE method are not homogeneous, and some of them are longer. The RPIE method makes them shorter and more homogeneous as it can be seen in Figure 3b, and in the evolution of the standard deviation width SdPIW in Table 7.
In a second attempt and following the engineers' recommendation, we consider a logarithmic transformation to the raw response y to avoid having non-positive lower bounds and integrate heterogeneity between performant and less performant well. The accuracy of the MLE method decreases now to Q 2 = 0.615, the MLE method still over estimates Prediction Intervals as it can be seen in Table 8.
Most claims of the previous analysis remain true, in particular we can clearly see (also in Figures 3c and 3d) that Prediction Intervals obtained by RPIE are shorter and have reduced standard-deviations.

Conclusion
In this paper, we have introduced a new approach for Prediction Intervals estimation based on the Cross-Validation method. We use the Gaussian Processes model because the predictive distribution at a new point is completely characterized by Gaussian distribution. We address an optimization problem for model's hyperparameters estimation by considering the notion of Coverage Probability.
The optimal hyperparameters are identified by minimizing the Wasserstein dis-  Cross-Validation, and the Gaussian distribution with hyperparameters achieving the desired Coverage Probability. This method is relevant when the model is misspecified. It insures an optimal Leave-One-Out Coverage Probability for the training set. It also achieves a reasonable Coverage Probability for the validation set when it is available. It can be also extended to other statistical models with a predictive distribution but more detailed work is needed to consider the influence of hyperparameters on Prediction Interval's coverage and solve the optimization problem more efficiently in these cases. Finally, it should be possible to include categorical inputs in the covariance function by using group kernels (Roustant et al., 2020), which would extend the application range of the RPIE method.

Appendix A. Proofs of Propositions 1 -3.
Appendix A.1. Preliminary lemmas Lemma 1. Let F be a full rank matrix (hypothesis H 1 ), let K be a positive definite matrix and let K defined by K = K −1 I n − F F K −1 F −1 F K −1 then Ker K = Im F and K is singular.
Proof. Let K be the matrix defined above. Suppose that x ∈ Im F, then there exists y such that x = Fy, and In case of Ordinary or Universal kriging, p = rank(F) = dim(Ker K) ≥ 1 which means that K is not invertible.
Lemma 2 ( De Oliveira (2007) Proof. K is a positive semi-definite matrix by Lemma 2 and we can write with λ j ≥ 0 the eigenvalues of K and (u j ) n j=1 the orthonormal basis of the eigenvectors. We have (A.5) If K ii = 0, then u j e i = 0 for all j such that λ j > 0. Therefore which shows that e i ∈ Ker K, that is, e i ∈ Im F by Lemma 1. Proof. This lemma is a direct application of Lemma 3 by choosing K = I n .

Appendix A.2. Proof of Proposition 1
From preliminary lemmas, we show now the stronger result (stronger than Proposition 1): Lemma 5. Under the hypotheses H 1 − H 3 , for any θ ∈ (0, +∞) d , there exists Proof. Here σ 2 > 0. Let us assume that a > 1/2 (i.e. q a > 0), then for θ fixed in (0, +∞) d , the limit of K when σ 2 → 0 is well defined and is equal to By the hypothesis H 2 and from Lemma 4 we can write for all i ∈ {1, . . . , n} When σ 2 → +∞, we have By lemma 3, we have R θ i,i > 0 for all i ∈ {1, . . . , n} and we obtain that With δ small enough satisfying δ < q a , we obtain Since k < an < n by hypothesis H 3 and since ψ (δ) a is continuous, the Intermediate Value Theorem gives the existence of σ 2 δ ∈ (0, +∞) such that which gives the desired result.
Similarly, if a < a/2 then q a < 0 and (A.15) When δ < q 1−a , one obtains By the hypothesis H 3 , one has the existence of σ 2 δ ∈ (0, +∞) such that which completes the proof of the lemma.

Appendix A.3. Proof of Proposition 2
The existence of σ 2 opt (λ) for all λ ∈ (0, +∞) results directly from the following lemma 6 : Lemma 6. For all λ ∈ (0, +∞), H δ (λ) is a non-empty and compact subset of By the Bolzano-Weierstrass theorem, we extract a convergent sub-sequence When there is a nugget effect σ 2 > 0, the limit of K m := K σ 2 when m → +∞ exists because the matrix K ∞ is nonsingular by the auxiliary fact 1 of De Oliveira et al. (2001) det From hypothesis H 1 , e is a column of F and we can prove that By hypothesis H 2 , the Leave-One-Out formulas (16-17) give for all i ∈ {1, . . . , n} If a > 1/2 for example and by definition of σ 2 opt (λ φ(m) ), one obtains which is contradictory. Therefore, lim λ→+∞ σ 2 opt (λ) = +∞ and L is coercive. The case a < 1/2 can be addressed in the same way.

Appendix B.2. Proof of the Coercivity
Let assume that, under some conditions on y, λ → σ 2 opt (λ) is well-defined for all λ ∈ (0, +∞). In the absence of nugget effect σ 2 = 0, the limit of R λθ0 does not exist when λ → +∞. Still, we can assume that the correlation matrix where λ → g λ is a continuous function such that lim λ→+∞ g λ = 0.
-D 0 and J = ee are fixed symmetric matrices.
D 0 can be singular or nonsingular depending on the chosen kernel k. A review of Yagloom's book (Rosenblatt, 1989) shows that D 0 is nonsingular only for Power-Exponential (q < 2) and Matérn kernels with smoothness parameter ν < 1 like the Exponential kernel (ν = 1/2 in (3)). For the rest of Matérn kernels with smoothness parameter ν ≥ 1 D 0 becomes singular.
In this case, let D λ = g λ D 0 (1 + o(1)) such that We consider the matrix R λθ0 in K = σ −2 R λθ0 , we have Analogously to the proof of Proposition 3, if we assume that lim λ→+∞ σ 2 opt (λ) = +∞ and by taking a sub-sequence σ 2 opt (λ ψ(m) ) m∈N converging to σ 2 The limit ψ which is contradictory and completes the proof.
In this case, one needs to go further in the Taylor expansion of R λθ0 . We consider the matrix W in Lemma 3, by Lemma 6 of Ren et al. (2012) By setting Σ λ = W R λθ0 W, the asymptotic study of R λθ0 is equivalent to the asymptotic study of Σ λ . In case of Matérn kernel with noninteger smoothness ν ≥ 1, the matrix Σ λ can be written as (Muré, 2021) Σ λ = g λ W D 1 W + g * λ W D * 1 W + R g (λ) , (B.18) where -Either g λ = cλ −2k1 with k 1 a nonnegative integer, or g λ = cλ −2ν .
-D 1 and D * 1 are both fixed symmetric matrices with elements where k ∈ k 1 ∪ ν for D 1 and k = l for D * 1 .
The matrix W D 1 W + g * λ W D * 1 W is nonsingular when λ → +∞, whether if W D 1 W is nonsingular or if it is singular.
The case where W D 1 W is nonsingular happens for Matérn kernels with smoothness 1 ≤ ν < 2 (Muré, 2021), whereas the other case occurs for regular and smooth Matérn kernels with ν ≥ 2. These kernels are however less robust in uncertainty quantification so we will give only the proof for less smooth kernels with 1 ≤ ν < 2 in particular the Matérn 3/2 kernel.
In this case, we write Σ λ in (B.18) as Σ λ = g λ W D 1 W I n + g * λ W D 1 W −1 W D * 1 W + R g (λ) . (B.19) As W is full rank matrix, Σ λ is non-singular and (B.20) Let M λ = g * λ W D 1 W −1 W D * 1 W + R g (λ) , since M λ λ→+∞ −→ 0, we can assume that M λ < 1 when λ is large enough and apply the Taylor series expansion at order 1 Then, we plug this quantity into the equation (B.20) Finally, we can write the matrix R λθ0 as We can also simply the previous expression into With Lemma 6 and Hypothesis H 6 , the proof of the divergence of σ 2 opt (λ) when λ → +∞ is similar to the previous case when D 0 is nonsingular.