Jointly Robust Prior for Gaussian Stochastic Process in Emulation, Calibration and Variable Selection

Gaussian stochastic process (GaSP) has been widely used in two fundamental problems in uncertainty quantification, namely the emulation and calibration of mathematical models. Some objective priors, such as the reference prior, are studied in the context of emulating (approximating) computationally expensive mathematical models. In this work, we introduce a new class of priors, called the jointly robust prior, for both the emulation and calibration. This prior is designed to maintain various advantages from the reference prior. In emulation, the jointly robust prior has an appropriate tail decay rate as the reference prior, and is computationally simpler than the reference prior in parameter estimation. Moreover, the marginal posterior mode estimation with the jointly robust prior can separate the influential and inert inputs in mathematical models, while the reference prior does not have this property. We establish the posterior propriety for a large class of priors in calibration, including the reference prior and jointly robust prior in general scenarios, but the jointly robust prior is preferred because the calibrated mathematical model typically predicts the reality well. The jointly robust prior is used as the default prior in two new R packages, called"RobustGaSP"and"RobustCalibration", available on CRAN for emulation and calibration, respectively.


Introduction
A central part of the modern uncertainty quantification (UQ) is to describe the natural and social phenomena by a system of mathematical models or equations. Some mathematical models are implemented as computer code in an effort to reproduce the behavior of complicated processes in science and engineering. These mathematical models are called computer models or simulators, which map a set of inputs such as initial conditions and model parameters to a real valued output.
Many computer models are prohibitively slow, and it is thus vital to develop a fast statistical surrogate to emulate (approximate) the outcomes of the computer models, based on the runs at a set of pre-specified design inputs. This problem is often referred as the emulation problem. Another fundamental problem in UQ is called the inverse problem or calibration, where the field data are used to estimate the unobservable calibration parameters in the mathematical model. As the mathematical model can be imprecise to describe the reality, it is usual to address the misspecification by a discrepancy function. Emulation and calibration are the main focus in many recent studies in UQ (Bayarri et al., 2007;Higdon et al., 2008;Bayarri et al., 2009;Liu et al., 2009;Conti and O'Hagan, 2010).
A Gaussian stochastic process (GaSP) is prevalent for emulating expensive computer model (Sacks et al., 1989;Bastos and O'Hagan, 2009) for several reasons. First of all, many computer models are deterministic, or close to being deterministic, and thus the emulator is often required to be an interpolator, meaning that the predictions by the emulator are equal to the outputs at the design inputs. The GaSP emulator is an interpolator, and can easily be adapted to emulate the stochastic computer model outputs by adding a noise term. Second, the number of runs of the computer model used to construct a GaSP emulator is often relatively small, which is roughly n ≈ 10p x by the "folklore" notion, where p x is the dimension of the inputs. Third, the GaSP emulator has an internal assessment of the accuracy in prediction, which allows the uncertainty to propagate through the emulator. The GaSP is also widely used to model the discrepancy function in calibration (Kennedy and O'Hagan, 2001;Bayarri et al., 2007), as combining the calibrated computer model and discrepancy function usually improves the predictive accuracy than the prediction using the computer model alone.
The GaSP model used in emulation and calibration is rather different than the one in modeling spatially correlated data. The key difference is that the input space of the computer model usually has multiple dimensions and completely different scales. The isotropic assumption is thus too restrictive. Instead, for any x a , x b ∈ X with p x dimensions, one often assumes a product correlation (Sacks et al. (1989)) where each c l is a one-dimensional isotropic correlation function for the lth coordinate of the input, each typically having an unknown range parameter γ l and fixed roughness parameter α l , l = 1, . . . , p x . This choice of the correlation will be used herein due to its flexibility in modeling correlation and tractability in computation.
The performance of a GaSP model in emulation and calibration depends critically on the parameter estimation of the GaSP model. For the emulation problem, it's been recognized in many studies that some routinely used methods, such as the maximum likelihood estimator (MLE), produce unstable estimates of the correlation parameters (Oakley, 1999;Lopes, 2011). The instability in parameter estimation results in a great loss of the predictive accuracy, as the covariance matrix is estimated to be near-singular or near-diagonal. This problem is partly overcome by the use of the reference prior (Berger et al., 2001;Paulo, 2005), where the marginal posterior mode estimation under certain parameterizations eliminates these two unwelcome scenarios (Gu et al. (2018b)).
Other than the reference prior, many proper and improper priors were previously studied for the GaSP model in emulation and calibration, often with a product form with various parameterizations, including the inverse range parameter β l = 1/γ l , natural logarithm of the inverse range parameter ξ l = log(β l ), and correlation parameter ρ l = 1/ exp(β l ), l = 1, . . . , p x . For example, π(β l ) ∝ 1/β l was utilized in Kennedy and O'Hagan (2001), and π(β l ) ∝ 1/(1 + β 2 l ) was assumed in Conti and O'Hagan (2010). An independent beta prior for ρ l is utilized in Higdon et al. (2008), and the spike and slab prior for the same parameterization is used in Savitsky et al. (2011). Though eliciting the prior information has been discussed in the literature (Oakley (2002)), it is rather hard to faithfully transform subjective prior knowledge to the GaSP model with the product correlation function in (1).
In this work, we propose a new class of priors, called the jointly robust (JR) prior, for both the emulation problem and calibration problem. This prior maintains most of the advantages of the reference prior in emulation, and it has a closed-form normalizing constant, moments and derivatives. In comparison, although the computational operations of the reference prior is normally acceptable, the derivative of the reference prior is more computationally expensive. In practice, the numerical derivatives of the reference prior are often used for the marginal posterior mode estimation, which makes the computation slow. Moreover, the prior moments of the reference prior are unknown and even hard to compute, because of the near-singular correlation matrix when all range parameters are large.
In the calibration problem, we establish the posterior propriety for the calibration problem of a wide class of priors, including the reference prior and JR prior in general scenarios. The identifiability problem of the calibration parameters was found in many previous studies (Arendt et al. (2012); Tuo and Wu (2015)), partly due to the large correlation estimated by the data (Gu and Wang (2017)). Though the posteriors of the reference prior and JR prior are shown to be proper for the calibration problem in this work, the density of the JR prior has slightly larger slope than the density of the reference prior when the range parameters in the covariance function get large, preventing the correlation from being estimated to be too large. Numerical results of the advantages of using the JR prior against the reference prior in calibration will be discussed. Two R packages, called "RobustGaSP" and "RobustCalibration", are developed for the emulation and calibration problems, and the jointly robust prior is used as the default choice in both packages (Gu et al. (2018a); Gu (2018)). Furthermore, another advantage of the JR prior is that it can identify the inert inputs efficiently through the marginal posterior mode of a full model, whereas the mode with the reference prior does not have this feature. The inert inputs are the ones that barely affect the outputs of the computer model. Having inert inputs is a fairly common scenario with computer models. E.g. in the TITAN2D computer model used for simulating volcanic eruption , the internal friction angle has a negligible effect on the output. In emulation, having an inert input can sometimes result in worse predictions than simply omitting them and in calibration, one may hope to spend more efforts in calibrating the influential inputs than the inert inputs. The full Bayesian variable selection of the inputs in a computer model is often prohibitively slow, as each evaluation of the likelihood is computationally expensive, whereas the marginal posterior mode by the JR prior is much faster for identifying the inert inputs.
Compared to other frequently used priors other than the reference prior, the new class of priors studied in this work is not a product of marginal priors of the range parameter or its transformation. The advantage is that the marginal posterior mode estimation with the new prior is both robust and useful in identifying the inert inputs in the computer model. The marginal posterior mode estimation of a product of marginal priors of the parameters in the covariance function, however, may not have both properties at the same time, explained in detail in Section 4.1.
The paper is organized as follows. In Section 2, we review the GaSP model in emulation and calibration, exploring the benefit of the reference prior in emulation that were not noticed before. A general theorem about the posterior propriety is also derived in the calibration setting. In Section 3, we introduce the JR prior, and compare with the reference prior in calibration and emulation. The variable selection problem is introduced in Section 4. The numerical studies of using the JR prior for emulation, variable selection and calibration will be discussed in Section 5. We conclude the paper in Section 6.

Gaussian stochastic process model
In this section, we first shortly introduce the GaSP model in Section 2.1. The model will be extended for emulation and calibration in Section 2.2 and Section 2.3, respectively. The posterior propriety will also be studied in the calibration problem in Section 2.3.
The mean function for any input x ∈ X is typically modeled via the regression where h(x) = (h 1 (x), . . . , h q (x)) is a row vector of the mean basis functions and θ mt is the unknown regression parameter of the basis function h t (·), for t = 1, . . . , q, with q being the number of the mean basis functions specified in the model.
The product correlation function in (1) is assumed and thus the correlation matrix is R = R 1 • R 2 • . . . • R px , where • is the Hadamard product. The (i, j) entry of R l is parameterized by c l (·, ·), a one dimensional correlation function for the lth coordinate of the input, l = 1, . . . , p x . We focus on two classes of widely used correlation functions: the power exponential correlation and Matérn correlation. Define d l = |x al − x bl | for any x a , x b ∈ X . The power exponential correlation has the form where γ l is an unknown nonnegative range parameter to be estimated and α l ∈ (0, 2] is a fixed roughness parameter, often chosen to be a value close to 2 to avoid the numerical problem when α l = 2 Gu and Berger, 2016).
The Matérn correlation has the following form where Γ(·) is the gamma function, K α l (·) is the modified Bessel function of the second kind with the range parameter and roughness parameter being γ l and α l , respectively. The Matérn correlation has a closed-form expression when α l = 2k l +1 2 with k l ∈ N, and becomes the exponential correlation and Gaussian correlation, when k l = 0 and k l → ∞, respectively. Though we focus on these two classes of correlation functions, the results are applicable to other correlation functions discussed in Gu et al. (2018b).

GaSP emulator and the reference prior
The goal of emulation is to predict and assess the uncertainty on the real-valued output of a computationally expensive computer model, denoted as f M (·), based on a finite number of chosen inputs, often selected to fill the input domain X , e.g. the Latin Hypercube Design (Sacks et al., 1989;Santner et al., 2003). Let us model the unknown function f M (·) via a GaSP defined in (2). Denote the outputs of the computer model f M = (f M (x 1 ), . . . f M (x n )) T at n chosen inputs {x 1 , . . . , x n }. Conditional on f M , the GaSP emulator is to predict and quantify the uncertainty of the output at x * by the predictive distribution of f M (x * ).
The GaSP emulator typically consists of mean parameters, a variance parameter and range parameters, denoted as (θ m , σ 2 , γ). The reference prior for the GaSP model with the product correlation was developed in Paulo (2005) and has the following expression with π R (γ) ∝ |I * (γ)| 1/2 , where I * (·) is the expected Fisher information matrix as below where W l =Ṙ l Q, for 1 ≤ l ≤ p x , andṘ l is the partial derivative of the correlation matrix R with respect to the lth range parameter, and Q = After marginalizing out (θ m , σ 2 ) by the prior in (6), we obtain the marginal likelihood, denoted as L(γ | y). As each evaluation of the likelihood requires the inversion of the covariance matrix, which is generally at the order of O(n 3 ), the full Bayesian computation through the Markov Chain Monte Carlo (MCMC) is typically prohibitive. It is common to simply estimate γ by the marginal posterior mode in emulation Some routinely used estimators, such as the maximum likelihood estimator (MLE) and maximum marginal likelihood estimator (MMLE) with regard to L(γ | y) have been found to be unstable in estimating the range parameters in various studies (see e.g. Figure 2 in Li and Sudjianto (2005), Figure 2.2 in Lopes (2011) and Figure 1 in Gu et al. (2018b)). The problem is often caused by the estimation of the range parameters, such that the estimated correlation matrix is near-diagonal (R ≈ I n , where I n is the identity matrix of size n) or near-singular (R ≈ 1 n 1 T n ). More specifically, as shown in Lemma 3.3 in Gu et al. (2018b), the profile likelihood function may not decrease when any γ l → 0, l = 1, . . . , p x , sometimes resulting in the near-diagonal correlation matrix, whereas the marginal likelihood may not decrease when any γ l → 0 or all γ l → 0, l = 1, . . . , p x , leading to the near-diagonal correlation matrix or near-singular correlation matrix, respectively. Thus, the robust estimation of the parameters is defined as avoiding these two possible problems, as follows. It is shown in Gu et al. (2018b) that the marginal posterior mode estimation with the reference prior is robust under γ or ξ = log(1/γ) parameterization, while some other alternatives, such as the MLE and MMLE, do not have this property. Note that the near-diagonal estimation (R ≈ I n ) can easily happen for p x > 1 when a product correlation structure is used because, if any of the matrices in the product correlation matrix is near-diagonal, the correlation matrix will be near-diagonal. Thus, using the maximum marginal posterior mode estimation with the reference prior is particularly helpful, when the dimension of the input is larger than 1.
The reference prior has many other advantages in emulation that were not noticed before. First, when the dimension of the inputs increases, the prior mass moves from the smaller values of γ l to the large values of γ l , for each l = 1, . . . , p x . This is an important property since, as any ofγ l ≈ 0,R is near-diagonal, a degenerate case that should be avoided. When p x increases, the chance that at least one γ l is estimated to be small increases, if the prior mass does not change along with p x and, consequently, the chance thatR ≈ I n also increases. The reference prior adapts to the increase of the dimension by concentrating more prior mass at larger γ l , avoidingR ≈ I n , when p x increases.
Second, when a denser design is used in a fixed domain of the input space, the prior mass of the reference prior parameterized by γ l moves to the domain with smaller values. Figure 1: Density of the reference prior of the log inverse range parameter (up to the normalizing constants). The power exponential correlation function in (4) is assumed where α l = 1.9, 1 ≤ l ≤ p x , with p x = 1 (upper panels) and p x = 2 (lower panels). From left to right, the number of design points are n = 20, n = 40 and n = 80, respectively, all generated from a maximin Latin Hypercube (LHD) on [0, 1] px (Santner et al. (2003)). For all the panels, we assume H = 1 n . This is helpful for the inversion of the covariance matrix in practice, because as points fill with a fixed domain of the input space, the covariance matrix becomes singular ifγ l does not change.
Here we provide a numerical justification of first two properties of the reference prior in Figure 1, where the reference prior density of the log inverse range parameter ξ l = log(1/γ l ) is shown, when the power exponential correlation function is assumed. Comparing the figures with different sample sizes, the mode of the prior moves to the region with larger values of log inverse range parameters (or equivalently the smaller range parameters), when the sample size increases. Comparing the first row and the second row of the panels in Figure 1, the prior mass moves to the region with smaller values of the log inverse range parameters (or equivalently the larger range parameters), when the dimension of the input increases.
The third property of interest is that the reference prior is invariant to the locationscale transformation of the inputs, if the mean basis functions contain only the intercept and the linear terms of x. When we apply a location-scale transformation of each coordinate of the inputx l = x l −c 0l c 1l , for l = 1, . . . , p x , the new reference prior isπ R (γ 1 , . . . , γ px ) = π R (γ 1 /c 11 , . . . , γ p /c 1px ). This makes the prior scale naturally to the range of the inputs; as a consequence, we do not need to normalize the inputs. Note when the mean basis functions contain only the intercept and the linear terms of x, the projection matrix is invariant to the local-scale transformation. Thus, only the scale transformation in the covariance function affects the reference prior and that is equivalent to scaling the range parameter by the same factor.
In addition, the reference prior has an appropriate tail decay rate at the limits when R = I n and R = 1 n 1 T n (Gu et al. (2018b)). When γ l → 0 for any l = 1, . . . , p x , the density of the reference prior decreases at an exponential rate approximately; when γ l → ∞ for all l = 1, . . . , p x , the density of the reference prior deceases at a polynomial rate. The first part of the tail rates induces an exponential penalty to the likelihood when the correlation matrix is near-diagonal, prohibiting the undesired situation in emulation. The posterior with the reference prior has slow polynomial decay rates when γ l is large for all l = 1, . . . , p x (or equivalently R ≈ 1 n 1 T n ), allowing the marginal likelihood to come into play at this limit. The larger range parameter was found to make prediction more precise (Zhang (2004)), and thus a small polynomial penalty from the reference prior both reduces the singular estimation of the covariance matrix and maintains high accuracy in prediction.
Despite various benefits in using the reference prior for emulation, the computational challenges still persist with the use of the reference prior, even if the posterior mode estimation is used in lieu of the posterior sampling. The computational order of the reference prior is O(p x n 3 ), which is mainly from computing W l in (7), for l = 1, . . . , p x , and the inversion of the covariance matrix. However, the closed-form derivatives of the reference prior are very computationally intensive, as it requires to compute x n 3 ), because the computational order of each directional derivative is O(p 2 x n 3 ) for the matrix multiplication. Because of these reasons, the author does not find any literature that provides the closed-form derivatives of the reference prior in this scenario, though some frequently used mode search algorithms, such as the low-storage quasi-Newton optimization method (Nocedal (1980)), typically rely on the information of the derivatives. Instead, one typically computes the numerical derivatives, which requires more evaluations of the likelihood, each with O(n 3 ) in computing the inversion of the covariance matrix, and thus it is also very time consuming. In addition, the reference prior could also induce some extra local modes, making the optimization algorithm harder to converge to the global mode (see e.g. the change of the slope of the reference prior density in the upper middle panel in Figure 1 and another example is given in Figure 3.3 in Gu (2016)).
Some inputs of the computer model may have very small effects on the outputs of the computer model. These inputs are called inert inputs and are often omitted in emulation. When the inert inputs are omitted, a noise term is often added to the GaSP emulator, as the emulator should no longer be an exact interpolator at the design points. The GaSP emulator can be extended to include a noise term or nugget,f M (·) = f M (·) + , where f M (·) still follows a GaSP model and ∼ N (0, σ 2 0 ) is an independent Gaussian noise. Define the nugget variance ratio parameter η = σ 2 0 /σ 2 . The reference prior π R (γ, η) has been derived for the GaSP model with a noise term (Ren et al. (2012); Kazianka and Pilz (2012); Gu and Berger (2016)). The advantages of using the reference prior with a nugget are similar to our previous discussion and are thus omitted here.

GaSP for computer model calibration
Some parameters in the computer model are unknown and unobservable in experiments. We denote the mathematical model output by f M (x, θ), where x is a p x -dimensional vector of observable inputs in experiment and θ is a p θ -dimensional vector of unobservable parameters. The calibration problem is to estimate θ by a set of field data y F := (y F (x 1 ), . . . , y F (x n )) T . In practice, a perfect mathematical model to the reality is rarely the case. It is common to address the model misspecification by a discrepancy function, such that the reality can be represented as where y R (·) and δ(·) denote the reality and discrepancy function, respectively. It leads to the following statistical model for calibration where ∼ N (0, σ 2 0 ) is an independent zero-mean Gaussian noise. For simplicity, we assume f M (·, ·) is computationally cheap to evaluate for now.
As we often know very little about the discrepancy function, the GaSP is suggested in Kennedy and O'Hagan (2001) to model the discrepancy function, i.e. δ(·) ∼ GaSP(μ(·), σ 2 c(·, ·)), where the mean and correlation functions are defined in (3) and (1), respectively. It is usual to define η = σ 2 0 /σ 2 , the nugget-variance ratio parameter for the computational reason, as now σ 2 is a scale parameter which has a conjugate prior.
The parameters in (9) consist of calibration parameters, mean parameters, range parameters, a variance parameter and a nugget parameter in the covariance function, denoted as (θ, θ m , γ, σ 2 , η). Consider the following prior for the calibration problem As the calibration parameters normally have scientific meanings, π(θ) is typically chosen by expert knowledge and thus we do not give a specific form herein. To the author's knowledge, the posterior propriety has not been shown for the above prior in the calibration problem, except for the case that f M (x, θ) is linear with regard to θ. We have the following theorem to guarantee the posterior propriety in this scenario. The proof for Theorem 1 generalizes the proof in Berger et al. (1998), which is a special case with a mean parameter, a variance parameter and two independent observations. Theorem 1. Assume the prior follows (10) for the calibration model in (9) with π(γ, η) and π(θ) being proper priors. Let H y = (H, y F ) be an n × (q + 1) matrix. If H y has full rank and n ≥ q + 1, the posterior is proper.
Proof of Theorem 1. Since H y has full rank and n ≥ q + 1, one can select q + 1 linearly independent rows of H y , denoted as H y0 , such that H y0 is invertible. W.l.o.g., we assume the first q + 1 rows of H y are linearly independent.
We first marginalize out the last n − q − 1 field data and the resulting density is denoted as p(y F 1:(q+1) | θ m , σ 2 , θ, γ, η). As π(θ) and π(γ, η) are both proper, we then marginalize out (θ, γ, η) and obtain the proper marginal density p(y F 1:(q+1) | θ m , σ 2 ). Since (θ m , σ 2 ) are the location-scale parameters for the marginal density, one has where the first equation follows from the definition of the location-scale family and the second equation follows from parameter transformation,ỹ i = y(xi)−h(xi)θm σ , for i = 1, . . . , q + 1, with the Jacobian determinant being Note that the reference prior in (6) is proper for many widely used correlation functions, as long as the intercept is contained in the mean basis matrix, i.e. 1 n ∈ C(H), where C(H) denotes the column space of the mean basis matrix (Gu et al. (2018b)). Theorem 1 states that using the reference prior is legitimate in the calibration problem when the mean basis contains an intercept. Empirically, the reference prior changes very little with an intercept added to the column space of the mean basis matrix.
We have assumed the mathematical model is computationally cheap so far. When the computer model is expensive to run, one can combine the GaSP emulator in the calibration model through a full Bayesian approach. In practice, however, since the field data typically contain larger noises and may not provide much information for the emulation purpose, a modular approach is often used, meaning that the GaSP emulator only depends on the outputs of the computer model (Liu et al. (2009)). We refer to Bayarri et al. (2007) for an overview of combining the GaSP emulator in a calibration model. The modular approach is implemented in Gu (2018), where the parameters in the GaSP emulator are estimated based on the outputs of the computer model (Gu et al. (2018a)). In the calibration, we draw a sample from the posterior predictive distribution of the GaSP emulator when we need to evaluate the computer model.

Jointly robust prior
We introduce a new class of priors for calibration and emulation of mathematical models in this section. In Section 3.1 and Section 3.2, we show that the new prior has all the nice features of the reference prior discussed in Section 2.2. The benefits of the new prior in calibration and identifying the inert inputs in mathematical models will be discussed in Section 3.1 and Section 4, respectively.

Calibration
We first introduce the new prior in the calibration setting, where the model in given in (9) with the discrepancy function δ(·) modeled as a GaSP. Define the inverse range parameter β l = 1/γ l , for l = 1, . . . , p x , and the nugget-variance parameter η = σ 2 0 /σ 2 in the covariance function. The overall prior follows The key part is the prior for the range parameters and nugget-variance parameter, where we call it the jointly robust (JR) prior and the form is given by where C is a normalizing constant; a > −(p x +1), b > 0 and C l > 0 are prior parameters. The name "jointly robust" is used to reflect the fact that the prior can't be written as a product of the marginal priors of the range parameter for each coordinate of the input, and it is robust in marginal posterior mode estimation (see Section 3.2 for details). The form px l=1 C l β l + η is inspired by the tail rate of the reference prior at R = 1 n 1 T n shown in Lemma 4.1 in Gu et al. (2018b). Besides, the JR prior is a proper prior. The posterior propriety of using (12) is thus guaranteed by Theorem 1, when π(θ) is proper.
We first show some properties of the JR prior and then discuss the default choice of the prior parameters. First of all, the normalizing constant of the prior is given as follows.

Lemma 1. (Normalizing constant.) The jointly robust prior is proper and has the
, where Γ(·) is the gamma function.
Proof of Lemma 1.
The marginal prior mean and variance are given in the following lemma.
Lemma 2. (Prior mean and variance.) For i = 1, . . . , p x , the prior mean and prior variance are given below.
Proof of Lemma 2. We only show the prior mean for β i , as the proof of the prior mean for η is similar. For any 1 Using the similar method for the prior mean, for any 1 ≤ i ≤ p x , we have The prior parameters of the jointly robust prior in Equation (12) consist of the overall scale parameter a, the rate parameter b and input scale parameters C l , l = 1, . . . , p x . First, we let C l = n −1/px |x max l − x min l |, where x max l and x min l are the maximum and minimum values of the input at the lth coordinate, which makes the reference prior invariant to the location-scale transformation of the input. The factor n −1/px is the average distance between the inputs after scaling the inputs by the range, as the average sample size of each coordinate of the input is n 1/px when we have n inputs from a Lattice design at a p dimensional input space. This choice allows the JR prior to match the behavior of the reference prior to the change of dimensions and number of observations. Second, we let b = 1 to have a large exponential penalty to avoid the estimation of R being near-diagonal.
The choice of a is an open problem and may depend on specific scientific goals. In the calibration setting, when a is close to −1 − p x , the prior density is almost flat when log(β) → 0 and log(η) → 0, resulting in the large estimated correlation in some scenarios, which makes the calibrated computer model without the discrepancy function fit the reality poorly (Gu and Wang (2017)). On the contrary, when a is large, the method is biased to small correlation, which makes the prediction less accurate. In the RobustCalibration package (Gu (2018)), a = 1/2 − p x is the default setting for the calibration problem, which balances between prediction and calibration. a = 1/2 − p x , b = 1 and C l = n −1/px |x max l − x min l |, l = 1, . . . , p x , will be used for all numerical comparisons in calibration.
In Figure 2, the densities of the JR prior and reference prior with p x = 1 are graphed in the upper panels. The JR prior matches the reference prior reasonably well. When the number of observations increases, the mass of the JR prior moves to the domain with the larger values of ξ, preventing overwhelmingly large correlation. The densities of the JR prior are graphed in the lower panels with p x = 2. When the dimension of inputs increases, the mass of the JR prior moves to the domain with the smaller values of ξ, preventing the covariance matrix from being estimated to be diagonal. Both features are important for avoiding the degenerate cases discussed in Section 2.2. Furthermore, with a = 1/2−p x , the tail of the JR prior decreases slightly faster than the reference prior when ξ → −∞ shown in Figure 2. This is helpful for the identification of the calibration parameters, an example of which is given in Section 5.3.

Emulation
In this subsection, we discuss parameter estimation with the JR prior in a GaSP emulator introduced in Section 2.2. As computer models are often deterministic, the JR prior has the following form . The JR prior in (13) is a special case of (12) with η = 0, so the properties of the prior discussed in Section 3.1 can be easily extended to this scenario.
One important feature of the reference prior is that the marginal posterior mode estimation is robust under the γ and ξ parameterization. Here we show a similar result when the JR prior is used to replace the reference prior in the maximum marginal posterior estimation in (8). Note that the marginal posterior mode with the reference prior under the parameterization of the inverse range parameter β is not robust. Surprisingly, the marginal posterior mode with the reference prior will always be atR = 1 n 1 T n under β parameterization, and should clearly be avoided (Gu et al. (2018b)). By Theorem 2, the marginal posterior mode estimation with the JR prior is robust under the β parameterization if a > 0. It has some added advantages for variable selection, as the posterior is positive if any β l = 0 given in the following remark. (13)  (iii.) When β E → 0 and #E < p x , π JR (β 1 , . . . , β px ) is finite and positive.

Remark 1. (Tail rates.) Assume the JR prior in
The first and second parts of the jointly robust prior match the exponential and polynomial tail decay rates of the reference prior discussed in Section 2.2. The third part is an improvement, which allows the identification of inert inputs by the marginal posterior mode with the jointly robust prior, discussed more in the next section. Note that the third part only holds for the parameter estimation under the parameterization by the inverse range parameter β, while the JR prior loses such property under the parameterization by the other parameterizations, e.g. γ and ξ. Thus we propose the following marginal posterior mode estimation with the reference prior where π JR (·) is the JR prior in (13), with a > 0, b > 0 and C l > 0. Here we use the same default prior parameters b = 1 and C l = n −1/px |x max l − x min l | for the reasons discussed in Section 3.1. a = 1/5 is implemented in the RobustGaSP Package as a default choice for emulation (Gu et al. (2018a)).

Variable selection and sensitivity analysis
This section discusses the issue of selecting influential inputs of computer models. We first introduce a computationally feasible approach of identifying the inert inputs by the JR prior and then discuss the sensitivity analysis approach.

Identifying the inert inputs by the JR prior
Consider a GaSP model to emulate the computer model f M (x). W.l.o.g., we assume the input only appears in the covariance function in (1). Variable selection in this context was considered in some previous studies. In Schonlau and Welch (2006), the variable is selected one by one through a screening algorithm with the functional analysis of the variance, while the number of models to be computed is at the order of p 2 x . In Linkletter et al. (2006), the size of the transformed range parameters is used as an indicator to decide whether the input is influential and the full posteriors are sampled from a Metropolis Hasting algorithm. In Savitsky et al. (2011), a spike and slab prior is used for the transformed range parameters for variable selection. However, the difficulty with the model selection strategy comes from the computational burden, as the model space is 2 px , and each evaluation of the likelihood requires O(n 3 ) operations for n observations. Note that the variable selection is hard in this context, as no closed-form marginal likelihood is available. However, when using the product correlation function in (1), the hope is that, for an inert input l,R l = 1 n 1 T n , in which it will not affect the correlation matrix R. Using the marginal posterior mode estimation with the reference prior, this would happen ifγ l → ∞. However, as shown in the following lemma, marginal posterior mode estimation with robust parameterizations utilizing the reference prior cannot identify inert inputs. The proof follows directly from the tail rate computed in Lemma 4.1 and Lemma 4.2 in Gu et al. (2018b).
Lemma 3. The marginal posterior of range parameters γ (or the logarithm of the inverse range parameters ξ) goes to 0 if some, but not all, γ l → ∞ (or ξ l → 0), l = 1, · · · , p x , for the power exponential and Matérn correlation functions, when the reference prior in (6) is used.
According to Lemma 3, the marginal posterior mode with two parameterizations will never appear atR l = 1 n 1 T n for any l, as the posterior density is 0 ifR l = 1 n 1 T n for some but not all l, l = 1, . . . , p x . The identifiability of inert inputs with the posterior mode estimation, however, requires the posterior density is positive when R l = 1 n 1 T n for some l = 1, . . . , k. Other transformation of the reference prior is also less likely to both maintain the robustness parametrization and identify inert inputs. Such difficulties could lead to inferior results of predictions when some inert inputs are present in computer models.
Note when a product of marginal priors of the range parameters is used, the marginal posterior mode may be either not robust or unable to identify the inert inputs. This is because to identify an inert input by the posterior mode, the posterior density should be positive whenR l = 1 n 1 T n , l = 1, . . . , p x . However, when a product of marginal priors of the range parameters is used, it requires positive posterior density whenR l = 1 n 1 T n , for all l = 1, . . . , p x , as the inert inputs cannot be determined a priori. This often violates the robust estimation in Definition 1, as the posterior mode can appear at R = 1 n 1 T n . Luckily, the marginal posterior mode with the JR prior is positive whenR l = 1 n 1 T n for some but not all l, l = 1, . . . , p x , stated in the following lemma.

Lemma 4. The marginal posterior of the inverse range parameters β is positive if 1 n /
∈ C(H) and some, but not all, β l → 0, l = 1, · · · , p x , for the power exponential and Matérn correlation functions using the JR prior in (13) with a > 0, b > 0 and C l > 0.
The proof of Lemma 4 follows from the tail rate of the marginal likelihood of the GaSP model (Lemma 3.1 and Lemma 4.1 in Gu et al. (2018a)) and the tail rate of the JR prior in Remark 1. When β l = 0, the lth input is not in the covariance in GaSP model. In practice, the exact zero estimation is not likely to be obtained. Thus, we use the normalized inverse range parameters as an indicator of the importance of each input where (β 1 , . . . ,β px ) are estimated in Equation (14). The involvement of C l is to take the scale of the input into account. The part px l=1 C lβl in the denominator is the overall size of the estimation and the C lβl is the contribution by the lth input.
The size of the transformed range parameters has been used to infer which the input is inert, but with a different prior (Linkletter et al. (2006)). However, the jointly robust prior yields better results, as compared in Section 5.2.
One may use a certain threshold of the normalized inverse range parameters to predict whether the input is inert or not, i.e.
where p 0 may be chosen as a constant between 0 to 1, l = 1, . . . , p x . Such value could also depend on the number of observations, dimension of the inputs and expected number of inputs to be chosen. However, because typically all inputs affect the outputs in a computer model, the threshold might be less important and can be chosen based on the scientific goal. We do not try to present a method for the full model selection, as each computation of the likelihood can be expensive. The point, here, is that the computation of P l does not take any extra computation (as the posterior modes are typically needed for building a GaSP model), and can serve as an indicator of the inert inputs.

Sensitivity analysis
Sensitivity analysis of computer model studies how changes of inputs affect the outputs. The inputs, in the computer model, typically associate with a distribution π(x), reflecting the belief of the input values. One of the main goal related to the sensitivity analysis is to identify how much a set of inputs influence the variability of outputs, which is studied through the functional analysis of the variance (functional ANOVA).
It is possible to decompose a function f M (·) as follows (Hoeffding (1948)) where x i,j = (x i , x j ) and x = (x 1 , . . . , x px ). One can obtain these element functions by taking expectation on x: , and so on. Here z i (x i ) is often referred as the main effect and z i,j (x i,j ) is referred as the second order effect.
The variance of the function can be decomposed below (Efron and Stein (1981)) Two principal measures called the main effect index and the total effect index were defined as follows (Sobol' (1990)) . S i is referred as the main effect index of x i and S Ti is referred as the total effect index of x i .
As pointed out in Oakley and O'Hagan (2004), S i has very clear interpretation. If we were to know the real value of the ith input, denoted as x r i , the uncertainty left is thus Since we do not know x i , it is common to take the expectation. Consequently, the decrease of the variance is then Var[E[f M (x) | x i ]] = V i . It means if we were able to select one input to explore its true value, we will select x i that maximizes V i . However, if one were able to select two inputs to explore, the answer is not to select the largest main effect index, but to select the largest V i,j : Thus, many higher order indices are needed to compute if one is interested in exploring more than the first few influential inputs. Main effect indices may serve as an approximation, and they are frequently used due to the computational reason.
When f M (x) and π(x) have simple forms, the main effect indices and higher order indices may be computed explicitly. However, these indices generally do not have a closed-form expression, thus the numerical estimation of these indices becomes important. Monte Carlo methods are proposed to evaluate these indices (Sobol' (2001)).
The shortage of the Monte Carlo method is that lots of computer model runs are often needed for the numerical estimation, which is unrealistic when the computer model is slow. One approach that significantly reduces the number of evaluation of the functions is discussed in Oakley and O'Hagan (2004). The idea is to use a small number of runs to fit the GaSP emulator and the posterior predictive distribution is used to replace the computer model outputs. The estimation of the indices can be implemented based on the emulator built on only very small number of runs from the computer model. We compare with these methods in Section 5.

Emulation
We numerically compare the predictive performance of the GaSP emulator using the marginal posterior mode estimation with the JR prior and reference prior. For the reference prior, we choose the log inverse range parameterization, ξ l = log(1/γ l ), because it is both robust and has empirically better predictive performance than the γ l parameterization (Gu et al. (2018b)). Both methods are implemented in the RobustGaSP package (Gu et al. (2018a)). The Matérn correlation with α l = 5/2 and a constant mean function are assumed for all cases. Also included are the results from the DiceKriging Package (Roustant et al. (2012)) with the same correlation function and mean function.
In each experiment, we use n inputs to construct the GaSP emulator, where n is typically chosen to be around 10p x , and then record the out-of-sample normalized root of mean squared error (NRMSE) of n * = 10,000 held-out outputs. We repeat the experiments for N = 200 random designs, generated from the maximin LHD (Carnell (2018)), and report the average normalized root of mean square error (Avg-NRMSE): with x * ij being the ith held-out input in the jth experiment,ŷ(x * ij ) being its prediction andȳ j being the mean of the observed output in the jth experiment, j = 1, . . . , N.
We test the following functions (implemented in Surjanovic and Bingham (2017)).
iv. Y = 10 sin(πX 1 X 2 ) + 20(X 3 − 0.5) 2 + 10X 4 + 5X 5 , where X i ∈ [0, 1], for i = 1, 2, 3, 4, 5. The Avg-NRMSEs of three methods of the five testing functions in Example 1 are shown in Table 1. The Avg-NRMSE of the methods with the reference prior and JR prior is similar, whereas the computational time of the JR prior is smaller, as the closed-form derivatives of the JR prior are known. The DiceKriging is the fastest method among three but the predictions are not as good as the robust methods.
When the sample size increases, the difference of the computational time for searching the marginal posterior mode between the JR prior and reference prior typically becomes larger (shown in Figure 4.4 in Gu (2016)). This is because the numerical

Variable selection
We first study the following example reported in Linkletter et al. (2006).
In Example 2, the last six input variables and the last two input variables are noises in case i and ii in Example 2, respectively. We use n = 54 design points and the Gaussian correlation function, same as in Linkletter et al. (2006). N = 1,000 random designs are generated from the maximin LHD design (Carnell (2018)). Both functions in Example 2 are linear, however, in the GaSP model, we only use the constant mean function, pretending that we don't know the linear trend of the real function.
The normalized inverse range parameterP l of Example 2 is shown in Figure 3. In the left figure, it is clear that the first four inputs are much more important than the rest of input. Indeed these are 4 signals while the other are noises. The second figure shows that the normalized inverse range parameters can identify the largest 3 to 4 signals.
A detailed comparison of our method and the reference distribution variable selection (RDVS) method introduced in Linkletter et al. (2006) is given in Table 2. In RDVS, a noise input is added to the design in each simulation, and the posterior median of the transformed range parameter of the inert input is recored. After many simulations, the transformed range parameters of the other inputs are compared to the percentile of the posterior median of the transformed range parameter of the inert input to determine whether an input is inert or not. In both cases, usingP l with the JR prior has   Table 2: Proportion of times each input is identified as influential inputs in Example 2 by the JR prior with different p 0 in (15) and RDVS method in Linkletter et al. (2006) with different percentiles (PT).
smaller false positives and false negatives, compared with the RDVS method, shown in Table 2. Furthermore, our method only relies on the marginal posterior mode of the range parameters in the full model, which is also faster than the RDVS method.
The cutoff value p 0 in our method can be hard to define. However, typically all inputs influence the outputs of the computer models. The task is then not to identify the true signals, but to identify what set of inputs are more important than the others. This seems successful for both functions in Example 2, as the importance of the factors is correctly ordered, shown in Figure 3.
In Example 3, we test the four functions (implemented in Surjanovic and Bingham (2017)) to explore whether the method can identify the signals. For the JR prior, we use the normalized inverse range parameter in (14) of each input as the index of the signal. We evaluate the performance by the fraction of times when the smallest index of the signals is larger than the largest index of the noises over N experiments.  Table 3: Tested sample size and the fraction of times when the smallest index of the signals is larger than the largest index of the noises recorded in the bracket for Example 3 by the different methods over N = 200 experiments.
In Table 3, the Sobol method with the Monte Carlo samples and its variants needs many more computer model runs to identify the signals, while the Sobol GaSP method needs many fewer runs, consistent with the previous study in Oakley and O'Hagan (2004). The Sobol GaSP method is not as good as the normalized inverse range parameter with the JR prior. One possible reason is that the Sensitivity package (Pujol et al. (2007)) utilizes the DiceKriging package (Roustant et al. (2012)) for the GaSP emulator, which is not as accurate as the robust GaSP emulator in prediction, discussed in Section 5.1.

Calibration
In this section, we compare the GaSP and S-GaSP calibration using a pedagogic example studied in Bayarri et al. (2007).
Example 4. The sampling model is y F (x) = y R (x)+ with the target function y R (x) = 3.5 exp(−1.7x) + 1.5, the computer model f M (x, θ) = 5 exp(−θx) and an independent Gaussian noise ∼ N (0, 0.3 2 ). Thirty observations are recorded at 10 different x i ∈ [0, 3], each with three repeated experiments shown in Bayarri et al. (2007). The goal is to estimate θ and predict the outputs at [0, 5]. Because the uncertainty of the calibration parameters is important in calibration, sampling from the posterior is typically more preferred than the MLE or posterior mode estimation. The Markov Chain Monte Carlo Algorithm is implemented in RobustCalibration package (Gu (2018)). We compare the GaSP calibration model in (9) with the reference prior and JR prior, based on S = 100,000 posterior samples with S 0 = 20,000 burn-in samples. As the computer model does not explain the mean of the process, we add a mean discrepancy term to the computer model (i.e. h(x) = 1) for all the models we considered. The mean discrepancy is treated as a part of the computer model because of its interpretability.
The posterior samples of the parameters with the reference prior and JR prior are graphed in the Figure 4. Since the reference prior has a flatter tail when the log inverse range parameter ξ → 0, the posterior of ξ with the reference prior is much smaller than the one with the JR prior, meaning that the correlation is estimated to be much larger. The large correlation by the reference prior leads to the large values of the posterior samples of the variance parameter, shown in the left panel at the second row in Figure 4, and consequently, the posterior samples of the mean parameter θ m spread widely over [−2 × 10 4 , 2 × 10 4 ], shown in the last panel in Figure 4. In comparison, the posterior mean parameter with the JR prior is much more concentrated, because the tail of the density of JR prior is slightly steeper when ξ → 0, preventing the correlation from being estimated to be too large.
To see the predictive performance of calibration, we test on n * = 200 held-out outcomes at x * i equally spaced at [0, 5] based on the predictive NRMSE in (16) and the The colored solid curves are the predictive mean using the calibrated computer model and discrepancy function, while the colored dashed curves are the predictive mean using only the calibrated computer model. The shaded area is the 95% predictive credible interval for the target function.
In the middle panel, the dash curve and the solid curve almost overlap.
following additional two criteria where CI i (95%) is the 95% posterior credible interval of the reality and L CI (95%) is the average length of the 95% posterior credible interval. For the results by the reference prior and JR prior, NRMSE is calculated for two scenarios. In the first scenario, only the calibrated computer model is used for prediction. Both the calibrated computer model and discrepancy function are used in the second scenario. An efficient method should have relatively low predictive NRMSE for both scenarios, P CI (95%) close to the 95% nominal level and short average credible interval lengths.
We compare the prediction with the reference prior and JR prior in Figure 5. Also included is the prediction with the MLE, in which we first maximize over the mean parameter and variance parameter, and then maximize the profile likelihood to estimate the rest of the parameters numerically. First, the prediction by the calibrated computer model with the reference prior overestimates the mean, caused by the posterior samples of the large correlation and variance parameter shown in Figure 4. In comparison, the prediction by the calibrated computer model with the JR prior is more accurate.
The prediction combining the calibrated computer model and discrepancy function is graphed as the colored solid curves in Figure 5. The model with the JR prior has a lower NRMSE than the one with the reference prior shown in Table 4, and as importantly, produces 95% posterior credible interval covered around 95% of held-out points in the target function. In contrast, the model with the reference prior seems overconfident in their accuracy assessment, caused by the large variance of the posterior mean parameter.
The prediction of the GaSP calibration with the JR prior is also better than the one with the MLE in terms of NRMSE. For the MLE, the uncertainty of the estimated  parameters is typically hard to quantify, when the sample size is small. Besides, the likelihood of the calibration parameter normally has multiple local modes, indicating the MLE should be operated with caution.

Concluding remarks
We have introduced the JR prior for emulation, calibration and variable selection in UQ. This prior performs as well as the reference prior in emulation, because the marginal posterior mode estimation with the JR prior is robust. The JR prior is considerably faster as the closed-form derivative is easy to compute. Furthermore, the marginal posterior mode with the JR prior can identify the inert inputs with no extra computational cost, whereas the marginal posterior mode of the reference prior and other priors may not be both robust in posterior mode estimation and accurate in identifying the inert inputs. In calibration, the JR prior is helpful for parameter identification with the current choice of the prior parameters, which avoids the correlation from being estimated to be too large. A principle way of determining the prior parameters is needed for the tradeoff in predictive accuracy and identifiability of parameters in calibration.
Nonetheless, the reference prior is still a very good prior for the emulation problem, as shown in many previous studies (Paulo (2005); Lopes (2011); Gu and Berger (2016)). We do not expect the JR prior to completely replace the reference prior, but instead offer researchers an additional choice that provides computational efficiency, explicit mean and variance, capability to identify the inert inputs using only the marginal posterior mode of the full model, and tail rates that prevent the correlation from being estimated to be too large in calibration.