hetGP : Heteroskedastic Gaussian Process Modeling and Sequential Design in R

An increasing number of time-consuming simulators exhibit a complex noise structure that depends on the inputs. For conducting studies with limited budgets of evaluations, new surrogate methods are required in order to simultaneously model the mean and variance ﬁelds. To this end, we present the hetGP package, implementing many recent advances in Gaussian process modeling with input-dependent noise. First, we describe a simple, yet eﬃcient, joint modeling framework that relies on replication for both speed and accuracy. Then we tackle the issue of data acquisition leveraging replication and exploration in a sequential manner for various goals, such as for obtaining a globally accurate model, for optimization, or for contour ﬁnding. Reproducible illustrations are provided throughout.


Introduction
Historically, computer experiments have been associated with deterministic black-box functions; see, for example, Sacks, Welch, Mitchell, and Wynn (1989).Gaussian process (GP) interpolators are popular in this setting.Recently, determinism has become less common for more complex simulators, e.g., those relying on agents.Stochastic simulation has opened up many avenues for inquiry in the applied sciences.Data in geostatistics (Banerjee, Carlin, and Gelfand 2004) and machine learning (Rasmussen and Williams 2006), where high-fidelity modeling also involves GPs, not only is noise common but frequently regression tasks involve signal-to-noise ratios that may be low or changing over the input space.In this context, whether for stochastic simulation or for data from observational studies, more samples are needed in order to isolate the signal, and learning the variance is at least as important as the mean.Yet GPs buckle under the weight of even modestly big data.Moreover, few options for heteroskedastic modeling exist.
Replication, that is, repeating experimental runs at the same design with different realizations or measurements, provides a powerful tool for separating signal from noise, offering a pure look at (local) variability.Replication can also yields computational savings in inference, and thus holds the potential to capture input-dependent dynamics in variance in GP regression.For example, in the operations research community, stochastic kriging (SK; Ankenman, Nelson, and Staum 2010) offers approximate methods that exploit large degrees of replication, coupling GP regression modeling on the mean and the variance, through hybrid likelihood and method-of-moments (MoM) based inference, toward efficient heteroskedastic modeling.A downside to SK is a dependence on large degrees of replication in order to support the MoM method.Binois, Gramacy, and Ludkovski (2018) provide similar results but relaxing the necessity of having a large degree of replication, and place inference fully within a likelihood framework.That methodology is the primary focus of the implementation described here.Other approaches for achieving a degree of input-dependent variance include quantile kriging (Plumlee and Tuo 2014); the use of pseudoinputs (Snelson and Ghahramani 2005), also sometimes called a predictive process (Banerjee, Gelfand, Finley, and Sang 2008); and (non-GP-based) tree methods (Pratola, Chipman, George, and McCulloch 2017a).Although the hetGP package has many aspects in common, and in some cases is directly inspired by these approaches, none of these important methodological contributions are, to our knowledge, coupled with an open source R implementation.
In the computer experiments and machine learning communities, sequential design of experiments/active learning (e.g., Seo, Wallat, Graepel, and Obermayer 2000;Gramacy and Lee 2009) and Bayesian optimization (e.g., Snoek, Larochelle, and Adams 2012) are a primary goal of modeling.The hetGP package provides hooks for learning and design to reduce predictive variance, organically determining an input-dependent degree of replication required to efficiently separate signal from noise (Binois, Huang, Gramacy, and Ludkovski 2019); for Bayesian optimization; for contour finding (Lyu, Binois, and Ludkovski 2018), and for leptokurtic responses (Shah, Wilson, and Ghahramani 2014;Chung, Binois, Gramacy, Bardsley, Moquin, Smith, and Smith 2019).A description and detailed examples are provided herein.
Although many R packages are available on CRAN for GP spatial and computer surrogate modeling (e.g., Erickson, Ankenman, and Sanchez 2017, provide a nice empirical review and comparison), we are not aware of any others that provide an efficient approach to fully likelihood-based coupled GP mean and variance modeling for heteroskedastic processes and, moreover, that provide a comprehensive approach to sequential design in that context.The tgp (Gramacy 2007;Gramacy and Taddy 2010) and laGP (Gramacy 2016) packages offer a limited degree of nonstationary and heteroskedastic GP modeling features via partitioning, which means they cannot capture smooth dynamics in mean and variance, nor can they efficiently handle replication in the design.Methods in the hetGP squarely target filling that void.The DiceKriging (Roustant, Ginsbourger, and Deville 2012) package is often used for SK experiments; however, users must preprocess the data and calculate their own moment-based variance estimates.The mlegp package (Dancik and Dorman 2008) offers a more hands-off SK experience, facilitating replicate detection, but without coupled modeling and sequential design features.Other R packages include GPfit (MacDonald, Ranjan, and Chipman 2015), which focuses expressly on deterministic modeling, whereas those like kergp (Deville, Ginsbourger, and Roustant 2018) target flexible covariance kernels, including additive or qualitative versions, in the homoskedastic setting.
The remainder of this paper is organized as follows.Section 2 reviews hetGP's approach to ordinary (homoskedastic) GP regression, and linear algebra identities that enable substantial speedups when the design has a high degree of replication.Section 3 details hetGP's latent variance approach to joint GP likelihood-based inference to accommodate heteroskedastic processes.Section 4 covers sequential design for hetGP models targeting reduction in prediction error, Bayesian optimization, and contour finding.Emphasis is on considerations in implementation, and example of use in practice, throughout.We conclude in Section 5 with a brief summary and discussion.

Gaussian process modeling under replication
Here we address computer experiments involving a function f (x) : x ∈ R d → R requiring expensive simulation to approximate a real-world (physical) relationship.An emulator fN is a regression or response surface fit to input-output pairs (x 1 , y 1 ), . . ., (x N , y N ), where y i = f (x i ) + ǫ i , with a centered (Gaussian) random noise ǫ i , e.g., ǫ i ∼ N (0, r(x i )).Predictive equations from fN , notated as fN (x ′ ) for a new location x ′ , may serve as a cheap surrogate for f (x ′ ) at new inputs x ′ for visualization, sensitivity analysis, optimization, and so forth.The hetGP package targets thrifty GP surrogates for stochastic computer model simulations whose variance and mean are both believed to vary nonlinearly.
Although our emphasis and vocabulary are tilted toward computer surrogate modeling throughout, many of the techniques we describe are equally well suited to applications in machine learning and geostatistics or anywhere else GP regression may be applied.Many of the methods presented here are inspired by advances in those areas.We begin by reviewing hetGP's approach to conventional, homoskedastic GP modeling involving a large degree of replication relative to the number of unique sites where computer simulation responses are measured.

Gaussian process review
Gaussian process regression is a form of spatial modeling via multivariate normal (MVN) distributions, assuming that the data comes from a GP denoted by Y .In the computer surrogate modeling community, one commonly takes a mean-zero GP formulation, moving all the modeling action to the covariance structure, which is usually specified as a decreasing function of distance.The formulation below uses a scaled separable (or anisotropic) Gaussian kernel.
with K N = (K θ (x i , x j ) + τ 2 δ i=j ) 1≤i,j≤N , and Lengthscales θ = (θ 1 , . . ., θ p ) govern the rate of decay of correlation as a function of coordinatewise distance; scale ν adjusts the amplitude of Y N realizations, and nugget τ 2 implements an iid (i.e., nonspatial) variance component for noisy data.For deterministic computer simulations, a zero nugget (τ 2 = 0) is common, although authors have argued that sometimes that can be inefficient (Gramacy and Lee 2012).Other forms for the correlation kernel K θ (•, •) are prevalent, and we shall discuss several others and their "hyperparameters" in due course.Although we shall mostly assume a constant zero trend throughout for notational conciseness, extensions like polynomial trends are relatively straightforward.For example, the hetGP allows a constant mean to be estimated, which can be helpful when GPs are used to smooth latent variances, as discussed in more detail in Section 3.
GPs make popular surrogates because their predictions are accurate, have appropriate coverage, and can interpolate when the nugget is zero.The beauty of GP modeling is evident in the form of its predictive equations, which arise as a simple application of conditioning for MVN joint distributions.Let D N = (X N , y N ) denote the data.GP prediction where k N (x) = (K θ (x, x 1 ), . . ., K θ (x, x N )) ⊤ .These are the so-called kriging equations in spatial statistics.They describe the best (minimizing MSPE) linear unbiased predictor (BLUP).
If the covariance structure is hyperparameterized, as it is in Equation ( 1), the multivariate structure can emit a log likelihood that may be utilized for inference for unknown hyperparameters (ν, θ, τ 2 ): To maximize ℓ with respect to ν, say, one differentiates and solves: The quantity ν is like a mean residual sum of squares.Now, plugging ν into ℓ gives the so-called concentrated log-likelihood, Using the chain rule and that yields closed-form expressions for the partial derivatives with respect to "•", a place holder for the hyperparameter(s) of interest, e.g., (θ 1 , . . ., θ k ) and τ 2 .Unfortunately these cannot be set to zero and solved analytically, but it can be done via numerical methods.
To illustrate and expose the essence of the implementation automated in the hetGP package (but for more involved settings discussed momentarily), we provide the following hand-coded example.The code block below implements the negative log-likelihood for parameters par, whose first ncol(X) settings correspond to the lengthscales θ, followed by the nugget τ 2 .

R> library("hetGP"
The gradient is provided by the code below. Consider a d = 2-dimensional input space and responses observed with noise, coded to the unit square.The data-generating function coded below was introduced by (Gramacy and Lee 2009) to illustrate challenges in GP regression when the response surface is nonstationary.(Heteroskedasticity is a form of nonstationarity; and although here the nonstationarity is in the mean, Binois et al. (2018, see Appendix C) showed that nevertheless hetGP methods can offer an appropriate quantification of predictive uncertainty on these data.)In order to help separate signal from noise, degree 2 replication is used.Replication is an important focus of this manuscript, with further discussion in Section 2.2.Initial designs are Latin hypercube samples, generated by using the lhs package (Carnell 2018).

R> library("lhs"
We use the L-BFGS-B method (via optim() in R), which supports bound constraints, to estimate hyperparameters, plugging in our negative log-likelihood (for minimizing) and gradient.For all parameters, we choose a small (nearly zero) lower bound.For the lengthscales the upper bound is 10, which is much longer than the square of the longest distance in the unit square ( √ 2); and for the nugget τ 2 we chose the marginal variance.

R> xx <-seq
The code below extends that covariance structure to the test set, both between themselves and with the training set.
A characteristic feature of GP predictive or kriging surfaces is that the predictive variance is lower at the training data sites and higher elsewhere.This is exemplified by the darker/indigoer colors near the open circles in the right panel, and lighter/yellower colors elsewhere.In 1D applications the error bars derived from predictive mean and variance surfaces take on a sausage shape, being fat away from the training data sites and thin, or "pinched," at the training locations.
Many libraries, such as those cited in our introduction, offer automations of these procedures.In this manuscript we highlight features of the hetGP package, which offers similar calculations as a special case.The code below provides an illustration.
There are subtle differences between the objective and optimization being performed, which accounts for the slight differences in higher significant digits.The hetGP package offers an S3 predict method that can facilitate prediction exactly in the manner illustrated above; but rather than duplicate Figure 1 here, we shall delay an illustration until later.
So to conclude this warm-up, GPs are relatively simple to specify, and inference involves a few dozen lines of code to define an objective (log-likelihood) and its gradient, and a library routine for optimization.Where do the challenges lie?Increasingly, data in geostatistics, machine learning, and computer simulation experiments involves signal-to-noise ratios that may be low and/or possibly changing over the input space.Stochastic (computer) simulators, from physics, business, and epidemiology, may exhibit both of those features simultaneously, but let's start with the first one first.With noisy processes, more samples are needed to isolate the signal, hindering the practical use of GPs.Evident in the equations above is the need to decompose an N × N matrix in order to obtain K −1 N and |K N |, at O(N 3 ) cost.What can be done?

Speedup from replication
Replication can be a powerful device for separating signal from noise and can yield computational savings as well.If ȳ = (ȳ 1 , . . ., ȳn ) ⊤ collects averages of a i replicates y (1) i , . . ., y and similarly for the variances σ2 i = then the "unique-n" predictive equations are a BLUP: where ), and a i ≫ 1.This is the basis of the stochastic kriging predictor (Ankenman et al. 2010), which is implemented as an option in the DiceKriging and mleGP packages on CRAN.The simplicity of this setup is attractive.Basically, an independent moments-based estimate of variance is used in lieu of the more traditional, likelihoodbased (hyperparametric) alternative.This could be advantageous if the variance is changing throughout the input space, as we shall discuss further in Section 3. Computationally, the advantage is readily apparent.Only O(n 3 ) matrix decompositions are required, which could represent huge savings compared with O(N 3 ) if the degree of replication is high.
However, independent calculations also have their drawbacks, e.g., lacking the ability to specify a priori that variance evolves smoothly over the input space.More fundamentally, the numbers of replicates a i must be relatively large in order for the σ2 i values to be reliable.Ankenman et al. (2010) recommend a i ≥ 10 for all i, which can be prohibitive.Thinking more homoskedastically, so as not to get too far ahead of our Section 3 discussions, the problem with this setup is that it does not emit a likelihood for inference for the other hyperparameters, such as lengthscale θ and scale ν.This is because the S n , ȳi and a i values do not constitute a set of sufficient statistics, although they are close to being so.
The fix involves Woodbury linear algebra identities.Although these are not unfamiliar to the spatial modeling community (see, e.g., Opsomer, Ruppert, Wand, Holst, and Hossler 1999;Banerjee et al. 2008;Ng and Yin 2012), we believe they have not been used toward precisely this end until recently (Binois et al. 2018).Here, the goal is to make the SK idea simultaneously more general and more prescriptive, facilitating full likelihood-based inference.Specifically, the Woodbury identities give ) where 1 k,l is a k×l matrix of ones, D = τ 2 I N = τ 2 1 N,N (or later D = Λ N in the heteroskedastic setting with a diagonal matrix of variances Λ N ).
Not only is storage of K n and U, which may be represented sparsely (or even implicitly), more compact than K N , but the Woodbury formulas show how to calculate the requisite inverse and determinants by acting on O(n 2 ) quantities rather than O(N 2 ).Pushing that application of the Woodbury identities through to the predictive equations, via the mapping νC between SK and our more conventional notation, yields the equality of predictive identities: µ n (x) = µ N (x) and σ 2 n (x) = σ 2 N (x).In words, the typical "full-N " predictive quantities may be calculated identically via "unique-n" counterparts, with potentially dramatic savings in computational time and space.The unique-n predictor is unbiased and minimizes MSPE by virtue of the fact that those properties hold for the full-N one.No asymptotic or frequentist arguments are required.Crucially, no minimum data or numbers of replicates (e.g., a i ≥ 10 for SK) are required, although replication can still be helpful from a statistical efficiency perspective (see Section 4.1).
The same trick can be played with the concentrated log-likelihood.Recall that K n = C n + A −1 n Λ n ; where for now Λ n = τ 2 I n .Later we shall generalize Λ n for the heteroskedastic setting.Using these quantities, we have where νN = N −1 (y Notice the additional terms in νN compared with νn := n −1 ȳ⊤ K −1 n ȳ, comprising of the missing statistics for sufficiency.Since Λ n is diagonal, evaluation of ℓ requires just O(n 3 ) operations, which means hyperparameter inference can proceed far more computationally efficiently than in typical setups.The derivative is available in O(n 3 ) time, too, facilitating numerical schemes such as L-BFGS-B: The potential computational benefit of such a mapping, which can be several orders of magnitude depending on the proportion of replication, has been illustrated using hetGP library calls in (Appendix A.2., Binois et al. 2018), and thus we shall not duplicate these here.However, we will remark that the implementation, which is not detailed in that Appendix, is similar to that outlined in Section 2.1: define negative log likelihood (nLL) and gradient (gnLL) via Equations 6-7, plug those into optim and, conditional on hyperparameters thus inferred, apply predictive equations (4-5).See mleHomGP and predict.homGP,respectively.Under the hood of those user-focused functions, you'll find three important subroutines (not all exported to the namespace, but potentially worth inspecting): • find_reps: pre-processing to find replicates and organize sufficient statistics.
The latter two are careful to apply an optimized sequence of linear algebra subroutines, and deploy custom Rcpp subroutines when vectorization is not immediate.An example is the extraction of the diagonal elements of a matrix product, required by the derivative (7), implemented as hetGP:::fast_diag.
Besides the SK feature offered by the mleGP package, we are not aware of any other software package, for R or otherwise, that automatically preprocesses the data (e.g., via find_reps) into a form that can exploit these Woodbury identities to speed calculations in the face of heavy replication, or with comparable computational advantage.Replication is a common feature in the sampling of noisy processes, and there is a substantial computational benefit to handling this form of data efficiently.

Implementing heteroskedastic modeling in hetGP
The typical GP formulation, utilizing a covariance structure based on Euclidean distance between inputs, emits a stationary process where features in input-output relationships are identical throughout the input space.Many data-generating mechanisms are at odds with the assumption of stationarity.A process can diverge from stationarity (i.e., be nonstationary) in various ways, but few are well accommodated by computationally viable modeling methodology.Even fewer come with public software, such as tgp (Gramacy 2007;Gramacy and Taddy 2010) and laGP (Gramacy 2016).Both of those packages offer a divide-and-conquer approach to accommodate disparate covariance structures throughout the input space.Such a mechanism is computationally thrifty, but it is not without drawbacks such as lack of continuity of the resulting predictive surfaces.
Input-dependent variance, or heteroskedasticity, is a particular form of nonstationarity that often arises in practice and that is increasingly encountered in stochastic computer experiment settings.An example is the so-called motorcycle accident data, available in MASS (Venables and Ripley 2002).These data arise from a series of simulation experiments measuring the acceleration on the helmet of a motorcycle rider before and after an impact, triggering a whiplash-like effect.Actually, the motorcycle data contains replicates, only about two-thirds of the inputs are unique.Ordinary homoskedastic GPs are notoriously inadequate on this example.Consider the following fit via mleHomGP, in comparison to the heteroskedastic alternative mleHetGP described next, both with a Matérn covariance kernel.
R> 60,length = 301), ncol = 1) R> p <-predict(x = Xgrid, object = hom) R> p2 <-predict(x = Xgrid, object = het) Figure 2 shows the resulting predictive surface overlaid on a scatter plot of the data.The thick lines are the predictive mean, and the thin lines trace out a 90% predictive interval.The output from predict separates variance in terms of mean (p$sd2) and residual (nugget-based p$nugs) estimates, which we combine in the figure to show the full predictive uncertainty of the response Observe in the figure that the homoskedastic fit is less than desirable.Specifically, uncertainty is largely overestimated across the first third of times, from left to right.This corresponds to the pre-impact regime in the experiment.Apparently, the predictive surface is overwhelmed by the latter two-thirds of the data, comprising of the higher-variance whiplash regime.Since the model assumes stationarity, a compromise must be made between these two regimes, and the one with stronger support (more data, etc.) wins in this instance.
Foreshadowing somewhat, consider the heteroskedastic fit also shown in the figure.As a consequence of being able to better track the signal-to-noise relationship over the input space, the heteroskedastic estimate of the signal (particularly for the first third of times) is better.acceleration (g) : Homoskedastic (solid red) versus heteroskedastic (dashed blue) GP fits to the motorcycle data via mean (thick) and 95% error bars (thin).Open circles mark the actual data, dots are averaged observations ȳ with corresponding error bars from the empirical variance (when a i > 0).
Predictive uncertainty is highest for the middle third of the times, which makes sense because this is where the whiplash effect is most prominent.The vertical error-bars plotted through each replicated training data point indicate moment-based estimates of variance obtained from the limited number of replicates in this example.(There are nowhere near enough for an SK-like approach.) The methodology behind hetGP was designed to cope with exactly this sort of behavior, but in a more ambitious setting: larger experiments, in higher dimension.If replication is an important tool in separating signal from noise in homoskedastic settings, it is doubly so in the face of input-dependent noise.Learning how the variance is changing is key to learning about mean dynamics.Therefore, handling and prescribing degrees of replication feature prominently in the methodology, and as a means of remaining computationally thrifty in implementation and execution.

Joint Gaussian process modeling
SK, introduced in Section 2.2, accommodates a notion of in-sample heteroskedasticity "for free" via independently calculated moments S n = Diag(σ 2 1 , . . ., σ2 n ), but that is useless out of sample.By fitting each σ2 i separately there is no pooling effect for interpolation.Ankenman et al. (2010) suggest the quick fix of fitting a second GP to the variance observations with "data," (x 1 , σ2 1 ), (x 2 , σ2 2 ), . . ., (x n , σ2 n ), to obtain a smoothed variance for use out of sample.A more satisfying approach from the machine learning literature (Goldberg, Williams, and Bishop 1998) involves introducing latent (log variance) variables under a GP prior and per-forming joint inference of all unknowns, including hyperparameters for both "mean" and "noise" GPs and latent variances, via MCMC.The overall method, which is effectively on the order of O(T N 4 ) for T MCMC samples, is totally impractical but works well on small problems.Several authors have economized on aspects of this framework (Kersting, Plagemann, Pfaff, and Burgard 2007;Lazaro-Gredilla and Titsias 2011) with approximations and simplifications of various sorts, but none (to our knowledge) have resulted in public R software. 1he key ingredient in these works, of latent variance quantities smoothed by a GP, has merits and can be effective when handled gingerly.The methods implemented in hetGP are built around the following methodology.
Let δ 1 , δ 2 , . . ., δ n be latent variance variables (or equivalently latent nuggets), each corresponding to one of the n unique design sites xi under study.It is important to introduce latents only for the unique-n locations.A similar approach on the full-N setup, that is, without exploiting the Woodbury identities, is fraught with identifiability and numerical stability challenges.Store these latents diagonally in a matrix ∆ n , and place them under a GP prior with mean β 0 , for the purpose of spatial smoothing, with lengthscales θ (g) .Then smooth λ i values can be calculated by plugging the above ∆ n quantities into the mean predictive equations.
Smoothly varying Λ n generalize Λ n = τ 2 I n from our earlier homoskedastic setup, when describing our Woodbury shortcuts under replication in Section 2.2.They also generalize the MoM estimated S n from SK.A nugget parameter g controls the smoothness of λ i 's relative to the δ i 's.A nonzero mean is preferable for this second GP since the predictions tend to revert to this mean far from the observations.Instead of estimating this additional hyperparameter or asking the user for a value, we found it better to use the classical plugin estimator of the mean for a GP, that is, β0 = ∆ Variances must be positive, and the equations above give nonzero probability to negative δ i and λ i values.One solution is to threshold values at zero.Another is to model log(∆ n ) in this way instead, as originally described by Binois et al. (2018).Differences in the development are slight.Here we favor the simpler, more direct version, in part to offer a complementary presentation to the one in that paper.
Rather than Goldberg's MCMC, Binois et al. (2018) describe how to stay within a (Woodbury) MLE framework, by defining a joint log-likelihood over both mean and variance GPs: Maximization may be facilitated via closed-form derivatives with respect to all unknown quantities, all in O(n 3 ) time.For example, the derivative with respect to the latent ∆ n values follows where Just like in Section 2, implementation is fairly straightforward once this likelihood and derivative are coded.Wrapper mleHetGP feeds lower-level objective hetGP:::logLikHet and derivative hetGP:::dlogLikeHet into optim after finding replicates with find_reps.Binois et al. (2018) show that l is maximized when ∆ n = Λ n and g = 0.In other words, smoothing the latent noise variables ( 8) is unnecessary at the MLE.Optimizing the objective naturally smooths the latent ∆ n values, leading to GP predictions that interpolate those smoothed values.However, smoothing is useful as a device in three ways: (1) theoretically: connecting SK to Goldberg's latent representation; (2) numerically: annealing to avoid local minima; and (3) practically: yielding a smooth solution when the numerical optimizer is stopped prematurely, which may be essential in big data (large unique-n) contexts.

Student-t variants
Student-t processes generalize GPs, keeping most of their benefits at almost no extra cost, offering an improved robustness to outliers and larger tail noise.Several choices exist in the literature; see, for example, the work of Wang, Shi, and Lee (2017).For implementation in hetGP, pairing with Woodbury and input-dependent noise features Chung et al. (2019), we follow the methodology described by Shah et al. (2014).Briefly, assuming a multivariate-t prior, Y N ∼ T N (α, 0, K N )) with α ∈ (2, ∞] being the additional degrees of freedom parameter, the modified predictive distribution is multivariate-t, and so does its derivatives.These are implemented by mleHetTP with hetGP:::logLikHetTP and hetGP:::dlogLikHetTP under the hood, again with similar optim calls.One additional mathematical detail is worth noting, namely that we employ a different parameterization of K N = σ 2 C N + τ 2 I N in this case, because the plugin value of ν does not apply here.But this parameterization of the covariance enables enables an SK variant of this model (similar to that of Xie and Chen (2017)), by setting log = FALSE in settings and giving the empirical variance estimates in Delta with known (or init for initialization) in mleHetTP.
We note that while TPs are more flexible that GPs, as N goes to infinity they become equivalent.In addition, the estimation of the parameter α can become unreliable (see, e.g., Wang et al. 2017), such that pre-specifying it at a low value may be beneficial.

Package structure and implementation details
Although several implementation details have been provided already, the content here is intended to give further insight into how hetGP works, including pointers to additional features such as fast updates as new data arrive.

Package structure
The main functions in hetGP are either related to GP fitting or their use in sequential design tasks, which are the subject of Section 4. Inferential interface functions mleHomGP, mleHetGP, mleHomTP and mleHetTP return homoskedastic/heteroskedastic GP/TP models in the form of S3 objects.They are completed with S3 methods: • plot, print, predict; • update to add one or several new observations, processing them to identify replicates.Warm-started re-estimation of the hyperparameters can be invoked with argument maxit > 0.
• strip to save a bare representation of the model by removing all pre-computed quantities that enable fast prediction like inverse covariance matrices.Conversely, the rebuild method recomputes all missing quantities.Note that Cholesky decomposition is used in the hyperparameter optimization for speed, while the generalized inverse via MASS's ginv can optionally be used when rebuilding.The extra stability offered by ginv can be important when covariance matrix condition numbers are small; however at the cost of a more expensive decomposition computationally.
Section 4 sequential design tasks can be carried out based on any of the model types with crit_IMSPE, targeting for global predictive accuracy, or crit_optim for Bayesian optimization and contour finding.Based on an initial surrogate model fit, both functions provide the next point to be evaluated according to the selected criterion.We discuss their options as well as additional details and functions related to those tasks in due course.
Data sets are also shipped with the package: the assemble to order ato and Bayes factor bfs data, with examples coming soon in Section 3.4.

Focus on hyperparameter tuning
Three kernel families are available in hetGP, parameterized as follows (univariate form): • Matérn with smoothness parameter 3/2: • Matérn with smoothness parameter 5/2: In hetGP they are referred to respectively as Gaussian, Matern3_2, and Matern5_2 and may be specified through the covtype argument.Multivariate kernels are defined simply as the product of univariate ones, with one lengthscale per dimension.Isotropic versions, i.e., with the same lengthscale parameter in each dimension, can be obtained by providing scalar values of the bound arguments upper and lower.
Selecting an appropriate range for lengthscale hyperparameter bounds (with lower and upper) is difficult: specifications that are too small lead to basically no learning, while overly large values result in oscillations and matrix conditioning issues.Rules of thumb may be conditioned on empirical inter-training point design distance distributions or other a priori considerations.To automate the process of choosing these bounds, hetGP offers the following as a pre-processing step for the coded [0, 1] d input domain unless defaults are manually overridden.Lengthscale lower bounds are chosen such that the correlation for a distance between any two points equal to the 5% quantile on distances is at least 1%.Likewise, an upper bound is defined such that the correlation between two points at the 95% quantile distance does not exceed 50%.These values are further rescaled if the domain is not [0, 1] d .Unless provided, the initial value for g in a homoskedastic model is based on the variance at replicates if there are any; otherwise, it is set to 0.1.
By default, GP (and not TP) models estimate an unknown mean (unless provided by the user with beta0 via the known argument).In the computer experiments literature this setup is known as ordinary kriging, as opposed to simple kriging with a zero mean.The plugin value of the mean, via MLE calculation, may be derived as β0 = . Predictive equations may by modified as follows In particular, utilizing β0 results in an extra term in the predictive variance.
The heteroskedastic setup entertained in hetGP relies on several key points: careful initialization of the latent variables, linking of lengthscale hyperparameters between latent GP and main GP, and safeguards on the variance GP component of the likelihood, namely, the variance smoothness regularization/penalty term.A default initialization procedure, provided in Algorithm 1 and invoked via (settings$initStrategy = "residual"), utilizes known or initial values of hyperparameters that can be passed as a list to known and init, with elements theta for the main GP lengthscale, theta_g for the latent GP ones, g_H the nugget parameter for an homoskedastic GP (i.e., τ 2 ), g the smoothing parameter (that ultimately goes to zero), and the latent variables Delta.Bounds for the hyperparameters of the noise process can be passed with noiseControl as a list.
To reduce the number of hyperparameters and thus ease maximization of the joint loglikelihood, by default the lengthscale hyperparameters of the latent GP theta_g are linked to the ones of the main GP theta by a scalar k_theta_g in [1,100].If this assumption that the noise variance is varying more smoothly than the mean is too limiting, as is the case in one example below, linking between theta and theta_g can be removed with settings$linkThetas = "none".Another option is to use settings$linkThetas = "constr", to use θ estimated in step 11 of Algorithm 1 as a lower bound for theta_g.
1: if Initial θ not provided then if any a i > 5 then 6: g Hom ← 0.05 end if 10: end if 11: Apply mleHomGP on D N with θ and g Hom , obtain or update θ, g Hom and νHom .12: if ∆ not provided then 13: Compute residuals from homoskedastic fit: Fit homoskedastic GP on (x i , δ i ) 1≤i≤n with θ, obtain θ g , g. 17: end if 18: Fit heteroskedastic GP on D N with initial hyperparameters θ Hom , θ (g) , g and ∆.
Algorithm 1: Pseudo code for the default hyperparameter initialization procedure in mleHetGP.
We have found that the initialization described above is sufficient for the majority of cases.But there are some pitfalls related to the joint likelihood objective in Equation ( 9).In fact, this setup may be seen as a bi-objective optimization problem, giving a set of compromise solutions between the log-likelihoods of the main and latent GPs, with a set of Pareto optimal solutions.However, that perspective would require selecting a solution afterwards, putting a human in the loop.If equal weights between objectives are used, as we do here, a solution on nonconvex parts of the Pareto front may not exist or may be impossible to find.Consequently, the solution returned corresponds to high likelihood for the latent GP and low likelihood for the mean GP.We usually observe this behavior when there is not much heteroskedasticity in the data.
To circumvent this undesirable behavior, we observe that our goal here is primarily to improve upon a homoskedastic fit of the data.This is enforced in two ways.First, if the likelihood of the main GP (without penalty) is lower than that of its homoskedastic counterpart, the penalty may safely be dropped from the objective.From the bi-objective point of view, this amounts to putting a constraint on the mean GP likelihood.Second, at the end of the joint likelihood optimization, if the likelihood of the mean GP with heteroskedastic noise is not better than that of the homoskedastic one, then the homoskedastic model can be returned.This latter check can be deactivated with settings$checkHom = FALSE.We also provide the compareGP function to compare two models on the same data based on their likelihood.
Updating an existing model.In sequential design tasks, data is coming point by point, or in batches.Instead of re-estimating all hyperparameters every time this happens, the update function is engineered to re-use previous values as much as possible.Specifically, initializing ∆ n+1 condition upon previous ∆ n values; it is also possible to make this more robust by fusing λ n+1 's prediction with the empirical variance estimation, as proposed by Binois et al. (2019), with method = "mixed".Since g may already be very small, we find that increasing it slightly with ginit when re-optimizing (maxit > 0) can be beneficial as a means of perturbing solver initialization and thus potentially avoiding repeated termination at a sub-optimal local minima.The overall procedure is summarized in Algorithm 2.

Illustrations
Here we consider several examples in turn, introducing three challenging real-world computer model simulation examples.
Susceptible, infected, recovered.Binois et al. (2018) consider a 2D problem arising from a simulation of the spread of an epidemic in a susceptible, infected, recovered (SIR) setting, as described by Hu and Ludkovski (2017).A function generating the data for standardized inputs in the unit square (corresponding to S and I) is provided by sirEval in the hetGP package.Consider a space-filling design of size n = 200 unique runs, with a random number of replicates a i ∈ {1, . . ., 100}, for i = 1, . . ., n.The x 1 coordinate represents the initial number of infecteds I 0 , and the x 2 coordinate the initial number of susceptibles S 0 .
The result is a full data set with N = 9862 runs.The code below gathers our response, which is the expected number of infecteds at the end of the simulation.

R> Y <-apply(X, 1, sirEval)
The code below fits a hetGP model.By default, lengthscales for the variance GP are linked to those from the mean GP, requiring that the former be a scalar multiple k > 1 of the latter.That can be undone, however, as we do below primarily for illustrative purposes.Bayes factor surfaces.Bayes factor (BF) calculation for model selection is known to be sensitive to hyperparameter settings, which is further complicated (and obscured) by Monte Carlo evaluation that interjects a substantial source of noise.To study BF surfaces in such settings, Franck and Gramacy (2018) propose treating expensive BF calculations, via MCMC say, as a (stochastic) computer simulation experiment.The idea is that BF calculations at a space-filling design in the (input) hyperparameter space can be used to map out the space and better understand the sensitivity of model selection to those settings.As a simple warmup example, consider an experiment described by Gramacy and Pantaleo (2010, Section 3.3-3.4)involving BF calculations to determine whether data is leptokurtic (Student-t errors) or not (simply Gaussian) as a function of the prior parameterization on the Student-t degrees of freedom parameter, which they took to be ν ∼ Exp(θ = 0.1).Their intention was to be diffuse, but ultimately they lacked an appropriate framework for studying sensitivity to this choice.Franck and Gramacy (2018) created a grid of hyperparameter values in θ, evenly spaced in log 10 space from 10 −3 to 10 6 spanning "solidly Student-t" (even Cauchy) to "essentially Gaussian" in terms of the mean of the prior over ν.For each θ i on the grid they ran the RJ-MCMC to approximate BF StN by feeding in sample likelihood evaluations provided by monomvn's (Gramacy 2017) blasso to compute a BF, following a postprocessing scheme described by Jacquier, Polson, and Rossi (2004).In order to understand the Monte Carlo variability in those calculations, ten replicates of the BFs under each hyperparameter setting were collected.Each BF StN evaluation, utilizing T = 100000 MCMC samples, takes about 36 minutes to obtain on a 4.2 GHz Intel Core i7 processor, leading to a total runtime of about 120 hours to collect all 200 values saved.The data is provided with the hetGP package.

R> data("bfs") R> thetas <-matrix(bfs.exp$theta, ncol = 1) R> bfs <-as.matrix(t(bfs.exp[, -1]))
For reasons that will become clear momentarily, Franck and Gramacy (2018) fit a heteroskedastic Student-t process, described briefly in Section 3.2.Even though they fit in log-log space, the process is still heteroskedastic and has heavy tails.We take this occasion to remark that data can be provided directly in the structure internally used in hetGP, with unique designs, averages and degree of replication.(Dubrule 1983), here adapted for TPs.Keeping the hyperparameters estimated on the entire data, it provides the prediction at existing designs as if it was unevaluated.This can provide useful information on the quality of the fit, especially on real data sets where the Gaussian/Student process hypothesis may not be valid.

R> bfs1 <-mleHetTP(X
Clearly the BF StN surface is heteroskedastic, even after log transform, and has heavy tails.The take-home message from these plots is that the BF surface is extremely sensitive to hyperparameterization.When θ is small, the Student-t (BF below 1) is essentially a foregone conclusion, whereas if θ is large, the Gaussian (BF above 1) is.A seemingly innocuous hyperparameter setting is essentially determining the outcome of a model selection enterprise.

Observed values
Predicted values Right: output given by the plot method.
Although the computational burden involved in this experiment, 120 hours, is tolerable, extending the idea to higher dimensions is problematic.Suppose one wished to entertain ν ∼ Gamma(α, β), where the α = 1 case reduces to ν ∼ Exp(β ≡ θ) above.Over a similarly dense hyperparameter grid, the runtime would balloon to 100 days, which is clearly unreasonable.Instead it makes sense to build a surrogate model from a more limited spacefilling design and use the resulting posterior predictive surface to understand variability in BFs in the hyperparameter space.Responses are gathered on a space-filling design in α × β-space, via a Latin hypercube sample of size 80, using a recently updated version of the monomvn library to accommodate the Gamma prior, with replicates obtained at each input setting, for a total of 400 runs.The data is also provided by the bfs data object in hetGP.
The story here is much the same as before in terms of β, which maps to θ in the earlier experiment.Near α = 1 (i.e., log 10 α = 0) the equivalence is exact.The left panel shows that along that slice one can get just about whatever conclusion one wants.Smaller α values   tell a somewhat more nuanced story, however.A rather large range of smaller α values leads to somewhat less sensitivity in the outcome because of the β hyperparameter setting, except when β is quite large.Apparently, having a small α setting is essential if the data is going to have any influence on model selection via BF.The right panel shows that the variance surface is indeed changing over the input space, justifying the heteroskedastic surrogate.
Another example with hetTP is provided by Chung et al. (2019), with data augmentation based on a latent variable scheme to account for missing data and to enforce a monotonicity property for solving a challenging class of inverse problems motivated by an influenza example.
Assemble to order.The assemble-to-order (ATO) problem (Hong and Nelson 2006) involves a queuing simulation targeting inventory management scenarios.The setup is as follows.A company manufactures m products.Products are built from base parts called items, some of which are "key" in that the product cannot be built without them.If a random request comes in for a product that is missing a key item, a replenishment order is executed, which is filled after a random delay.Holding items in inventory is expensive, so there is a balance between inventory costs and revenue.Hong and Nelson built a MATLAB simulator for this setup, which was reimplemented by Xie, Frazier, and Chick (2012).Binois et al. (2018) describe an out-of-sample experiment based on this latter implementation in its default setting (originally from Hong and Nelson), specifying item cost structure, product makeup (their items) and revenue, and distribution of demand and replenishment time, under target stock vector inputs b ∈ {0, 1, . . ., 20} 8 for eight items.
Here we provide code that can be used to replicate results from that experiment, which involved a uniform design of size n tot = 2000 in 8D space, with ten replicates for a total design size of N tot = 20000.The R code below loads in that data, which is stored in the data object ato, provided with hetGP.

R> data("ato")
In order to create an out-of-sample test setting, random training-test partitions were constructed by randomly selecting n = 1000 unique locations and a uniform number of replicates a i ∈ {1, . . ., 10} thereupon.The ato data object also contains one such random partition, which is subsequently coded into the unit cube [0, 1] 8 .Further detail is provided in the package documentation for the ato object.Actually the object provides two test sets.One is a genuine out-of-sample test set, where the test sites do not intersect with any of the training sets.The other is replicate based, involving replicates 10 − a i not selected for training.The training set is large, which makes MLE calculations slow, so the ato object provides a fitted model for comparison.In the examples section of the ato documentation, code is provided to reproduce that fit or to create a new one based on a new random training-test partition.

R> c(n = nrow(Xtrain), N = length(unlist(Ztrain)), time = out$time)
n N time.elapsed1000.0005594.0008583.767 The main reason for storing these objects is to enable a fast illustration of prediction and outof-sample comparison, in particular to have something to compare with a more thoughtful sequential design scheme outlined in Section 4. The code below performs predictions at the held-out test locations and then calculates a proper score (Gneiting and Raftery 2007, Equation ( 27)) against the ten replicates observed at each of those locations.Higher scores are better.A function scores in hetGP computes scores and root mean squared error (RMSE).
R> sc <-scores (out, Xtest, matrix(unlist(Ztest), byrow = TRUE, ncol = 10)) A similar calculation for the held-out training replicates is shown below.These are technically out of sample, but accuracy is higher since training data was provided at these locations.
R> sc.out <-scores(model = out, Xtest = Xtrain.out,Ztest = Ztrain.out) Combined, the two test sets provide a single score calculated on the entire corpus of held-out data.As expected, the result is a compromise between the two score statistics calculated earlier.

R> c(test = mean(sc), train = mean(sc.out), combined = mean(c(sc, sc.out)))
test train combined 3.394957 5.087207 4.241082 Binois et al. (2018) repeat that training-test partition 100 times to produce score boxplots that are then compared with simpler (e.g., homoskedastic) GP-based approaches.We refer the reader to Figure 2 in that paper for more details.To make a long story short, fits accommodating heteroskedasticity in the proper way-via coupled GPs and fully likelihoodbased inference-are superior to all other (computationally tractable) ways entertained.

Sequential design
We have been using uniform or space-filling designs in our examples above, and with replication.The thinking is that coverage in the input space is sensible and that replication will help separate signal from noise (and yield computational advantages).Model-based designs, such as maximum entropy and related criteria, are almost never appropriate in this setting.Such designs, since they condition on the model, are hyperparameter sensitive; and before data is collected, we have little to go on to choose good settings for those values.This situation is exacerbated in the heteroskedastic case, requiring the setting of latent inputs and additional hyperparameters.A far better approach is to take things one step at a time: start with a small space-filling design (small N ), perhaps with some replication, and choose the next point, x N +1 , by exploring its impact on the model fit, say through the predictive equations.An interesting question, in such settings, is how much such criteria would prefer to replicate (repeat existing inputs) versus explore (try new locations).
In the classical GP setup for computer simulations, with low or no noise and an assumption of stationarity (i.e., constant stochasticity), one can argue that replication is of little or no value.When signal-to-noise ratios are low and/or changing over the input space, however, intuition suggests that some replication will be valuable.Until recently, however, it was not known how such choices would manifest in typical sequential design or data acquisition decisions.Of course, the details depend on the goal of modeling and prediction.Below we consider two settings: (1) obtaining accurate predictions globally (Section 4.1) and ( 2) optimizing or targeting particular aspects of the predictive distribution such as level sets or contours (Section 4.2).Binois et al. (2019) studied the exploration vs. replication question in the context of improving predictive accuracy by means of sequential design.A common criterion in that setting is the integrated mean-square prediction error (IMSPE), which is just predictive variance averaged over the input space, specifically,

IMSPE
This formula is written in terms of unique-n inputs for reasons that we shall clarify shortly.It could, however, instead be phrased in terms of full-N equations.The idea would be to choose the next design location, x N +1 , which could be a new unique location (x n+1 ) or a repeat of an existing one (i.e., one of x1 , . . ., xn ), by minimizing I n+1 .The presentation in Binois et al. is more careful, but also more cumbersome, with the notation in this regard.
In general, the integral in Equation ( 11) requires numerical evaluation (Seo et al. 2000;Gramacy and Lee 2009;Gauthier and Pronzato 2014;Gorodetsky and Marzouk 2016;Pratola, Harari, Bingham, and Flowers 2017b).For example, the tgp package (option Ds2x=TRUE) sums over a reference set, as well as averaging over the posterior distribution of (treed) Gaussian process parameterization.However, conditional on GP hyperparameters, and when the study region D is an easily integrable domain such as a hyperrectangle, the requisite integration may be calculated in closed form.Although examples exist in the literature (e.g., Ankenman et al. 2010;Anagnostopoulos and Gramacy 2013;Burnaev and Panov 2015;Leatherman, Santner, and Dean 2017) for particular cases (and approximations), we are not aware of examples providing the level of specificity and generality in derivation, or implementation in code as provided in hetGP.
The essence is as follows. where Closed forms for the W ij exist with D being a hyperrectangle, say.Binois et al. (2019) provide forms for several popular covariance functions, including the Matérn, that are coded in hetGP.In the case of the Gaussian, we quote as follows: Having a closed form is handy for evaluation.Even better is that a closed form exists for the gradient, which facilitates numerical optimization for sequential design.We leave the details of that calculation to Binois et al..
To investigate how replication can be favored by IMSPE in choosing x n+1 , consider the following setup.Let ] denote a belief about the (otherwise zero mean and iid) noise process.The form of r(x) can be arbitrary.Below we set up two univariate examples that follow splines that agree at five "knots."In this illustration, the x-locations of those knots could represent design locations x i where responses have been observed.
R> XX <-matrix(seq(0, 1, by = 0.005)) Below we shall refer to these surfaces as "green" and "blue," respectively, referencing the colors from Figure 6.The "knots" are shown as red open circles.
In the figure the blue variance function hypothesis is minimized at a novel x n+1 location, not coinciding with any of the previous design sites.However, the green hypothesis is minimized at x 2 , reading from left to right.The IMSPE calculated on this particular variance function r(x) prefers replication.That is not a coincidence or fabrication.Binois et al. showed that the next point x N +1 will be a replicate, that is, be one of the existing unique locations x1 , . . ., xn rather than a new xn+1 when where k * = argmin k∈{1,...,n} IMSPE(x k ) and B k = .
Basically, this relationship says that IMSPE will prefer replication when the variance is "large enough."Rather than read tea leaves more deeply than that, let us see it in action in our toy example.The code below utilizes some of hetGP's internals to enable evaluation of the right-hand side of the inequality above, in other words, treating it as an equality.
That green hypothesis is, of course, just one example of a variance function that is above the requisite threshold for replication.Also, the hypotheses we used were not derived from GP predictive equations.But the example is designed to illustrate potential.Before turning to a more prescriptive search for new sites, be they replicates or new unique locations, let us summarize what we know.Replication (1) is good for the GP calculations (n 3 ≪ N 3 ); ( 2) is preferred by IMSPE under certain (changing) variance regimes; and (3) helps separate signal from noise.But how often is IMSPE going to "ask" for replications?The short answer is, not often enough, at least from an empirical standpoint.
One challenge is numerical precision in optimization when mixing discrete and continuous search.Given a continuum of potential new locations, up to floating-point precision, a particular setting corresponding to a finite number of replicate sites is not likely to be preferred over all other settings, such as ones infinitesimally nearby (no matter what the theory prefers).
Another issue, which is more fundamental, is that IMSPE is myopic.The value the current selection, be it a replicate or new unique location, is not assessed vis à vis its impact on the future decision landscape.In general, entertaining such decision spaces is all but impossible, although that does not stop people from trying.See Section 4.2 for a related discussion in an optimization context.
Replication-biased lookahead Binois et al. (2019) found that a "replication-biased" lookahead is manageable.Specifically, consider a horizon h into the future.Looking ahead over those choices, one can select a new unique xn+1 and h replicates, each at one of the n + 1 unique locations.Or alternatively, one can take a replicate first and a unique design element later (and there are h ways to do that).A longer horizon means a greater chance of replication but also more calculation: h + 1 continuous searches for new locations and (h + 1)(h + 2)/2 − 1 discrete ones in search of potential replicates.The horizon h can thus be thought of as a tuning parameter.Before considering choices for how to set this value, let us look at an example for fixed horizon, h = 5.
Consider an initial uniform design of size N = n = 10 (i.e., without replicates) and an initial hetGP fit based on a Gaussian kernel.

Return homoskedastic model
Next, let us search via IMSPE with horizon h = 5 lookahead over replication.The call to IMSPE_optim below utilizes the closed form IMSPE (12) and its derivatives within an optim call using method="L-BFGS-B".

R> opt <-IMSPE_optim(mod, h = 5) R> c(X, opt$par)
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 [7] 0.6666667 0.7777778 0.8888889 1.0000000 0.2222222 Whether or not the chosen location, in position eleven above (0.222), is a replicate depends on the random seed used to compile this document, so it is difficult to provide precise commentary in-line here.As mentioned earlier in Section 3.3, hetGP provides an efficient updating method, update, utilizing O(n 2 ) or O(n) updating calculations for new data points, depending on whether that point is unique or a replicate, respectively.Details of those updates are provided by Binois et al. (2019), extending existing ones in the literature (e.g., Gramacy and Polson 2011;Chevalier, Ginsbourger, and Emery 2014a;Gramacy and Apley 2015) to the heteroskedastic case.

R> X <-c(X, opt$par) R> Ynew <-fr(opt$par) R> Y <-c(Y, Ynew)
R> mod <-update(mod, Xnew = opt$par, Znew = Ynew, ginit = mod$g * 1.01) That is the basic idea.Let us continue and gather a total of 500 samples in this way, in order to explore the aggregate nature of the sequential design so constructed.Periodically (every 25 iterations in the code below), it can be beneficial to restart the MLE calculations to help "unstick" any solutions found in local modes of the likelihood surface.Gathering 500 points is somewhat of an overkill for this simple 1D problem, but it helps create a nice visualization.

R> for(i in 1
Of the total of N = 500, the final design contained n = 57 unique locations.To help visualize and assess the quality of the final surface with that design, the code below gathers predictive quantities on a dense grid in the input space.

R> xgrid <-seq(0, 1, length = 1000) R> p <-predict(mod, matrix(xgrid, ncol = 1)) R> pvar <-p$sd2 + p$nugs
Figure 7 shows the resulting predictive surface in red, with additional calculations to show the true surface, via f (x) and error-bars from r(x), in black.Gray vertical bars help visualize the degree of replication at each input location.
Observe that the degree of replication, as well as the density of unique design locations, is higher in the high-noise region than it is in the low-noise region.In a batch design setting and in the unique situation where relative noise levels were known, a rule of thumb of more samples or replicates in the higher noise regime is sensible.The trouble is that such regimes are rarely known in advance, and neither are the optimal density differentials and degrees of replication.Proceeding sequentially allows us to learn and adapt the design as we go.

Tuning the horizon
Above we used horizon h = 5, but that was rather arbitrary.Although it is difficult to speculate on details regarding the quality of the surface obtained in Figure 7, improvements are likely possible.Chances are that the uncertainty is overestimated in some regions and x y underestimated in others.A solution lies in adapting the lookahead horizon online.Binois et al. (2019) proposed two simple on-line adjustments that tune the horizon.The first adjusts the horizon of the next IMSPE search in order to target a desired ratio ρ = n/N and thus manage the surrogate modeling cost: The second attempts to adapt to minimize IMSPE regardless of computational cost: and comes from a criterion in the SK literature (Ankenman et al. 2010).
The code below duplicates the example above with the adapt heuristic.Alternatively, a horizon call can be provided target and previous_ratio arguments to implement the target heuristic instead.First, reinitialize the design.R> X <-seq(0, 1, length = 10) R> Y <-fr(X) R> mod.a <-mleHetGP(X = X, Z = Y, lower = 0.0001, upper = 1) R> h <-rep (NA,500) Next, loop to obtain N = 500 observations under the adaptive horizon scheme.The left panel of Figure 8 shows the adaptively selected horizon over the iterations of sequential design.The right panel shows the final design and predictions, versus the truth, matching Figure 7 for the fixed horizon (h = 5) case.Adaptive Horizon Design The left panel of the figure reveals that a horizon of h = 5 is indeed not totally uncommon, but it is also higher than generally preferred by the adaptive scheme, which had n = 140 unique sites-more than the h = 5 case, but in the same ballpark compared with the full size of N = 500.

R> for(i in
The code below offers an out-of-sample comparison via RMSE (lower is better) to the truth and score (higher is better) against a noisy analog.Although speculating on the precise outcome of this comparison is difficult because of the noisy nature of sampling and horizon updates, the typical outcome is that the two comparators are similar on RMSE, which is quite small, but that score prefers the adaptive scheme.Being able to adjust the horizon enables the adaptive scheme to better balance exploration versus replication in order to efficiently learn the signal-to-noise ratio in the data-generating mechanism.The sequential design leads to more accurate predictors than the batch design does, despite having being trained on 40% as many runs.To illustrate how that design was built, we first need to "rebuild" the out.aobject.For compact storage, the covariance matrices, inverses, and so forth have been deleted via the strip method.An alternative for compactness is to use return.matrices= FALSE via settings in the mleHetGP command.

R> out.a <-rebuild(out.a)
The calculation sequence for each step of the search for this design involves first determining the horizon and then searching with that horizon via IMSPE.In code, that amounts to the following.
R> Wijs <-Wij(out.a$X0,theta = out.a$theta,type = out.a$covtype)R> h <-horizon(out.a,Wijs = Wijs) R> control h,Wijs = Wijs,control = control) Precalculating the Wijs saves a little time since these are needed for both horizon and IMSPE_optim.Modifying the default setup may be done with control, a list with elements multistart = 20 for the number of optim calls to be run starting from a maximin LHS design of this size, up to maxit iterations, and tol_dist (resp.tol_diff) defining the minimal distance (resp.criterion difference) with respect to existing designs under which replication is preferred.This task may be performed in parallel with the ncores argument.The 8D location above would then be fed into the computer model simulator and the input output pair subsequently added into the design.An indication of whether or not the new location is unique (i.e., actually new) or a replicate is provided with path along with the IMSPE_optim output, par, and value.The path list contains the best sequence of points found by the lookahead procedure, with elements par, value, and new of the next design plus h further ones selected successively.

R> opt$path[[1]]$new
[1] FALSE Since ATO inputs are actually on a grid, we would have to first snap this continuous solution to that grid (after undoing the coding of the inputs).In so doing we could create a replicate as part of that discretization process.

Contour finding and optimization
So far we have focused on constructing a globally accurate surrogate model, but it is also common to target specific regions of interest.Examples include global minima or level sets (Bogunovic, Scarlett, Krause, and Cevher 2016).
Bayesian optimization.Starting with optimization, namely, finding x * ∈ argmin x f (x), we follow the canonical efficient global optimization framework (Jones, Schonlau, and Welch 1998) with the expected improvement (EI) criterion (Mockus, Tiesis, and Zilinskas 1978).For a review of many variants of this and other so-called Bayesian optimization (BO) methods, see, for example, Shahriari, Swersky, Wang, Adams, and de Freitas (2016) and Frazier (2018).Picheny, Wagner, and Ginsbourger (2012) provide benchmarks specifically for the noisy objective case.In that setting, where more evaluations are needed to separate signal from noise, EI is appealing since it does not need extra tuning parameters to balance exploitation and exploration, does not require numerical approximation, and has a closed-form derivative.
For a deterministic function f , the improvement I : x ∈ R d → R is defined as a random variable.EI is the expectation of the improvement conditioned on the observations, which has a closed-form expression: where y * = min i∈{1,...,n} y i and φ and Φ are the pdf and cdf of the standard normal distribution, respectively.In the noisy case, y * defined in this way is no longer a viable option, but it can be replaced for instance with min i∈{1,...,n} µ n (x i ) as recommended by Vazquez, Villemonteix, Sidorkiewicz, and Walter (2008) or min x µ n (x) as described by Gramacy and Lee (2011).Going back to the 1D example, the code below implements this replication-biased lookahead scheme in a maximization context (more appealing on this test function).Notice that we initialize with a small amount of replication.This is to guard against initial (and incorrect) noiseless surrogate fits that cause numerical issues.Initial designs for Bayesian optimization of noisy functions is still an open area of research.Also, we fix a lookahead horizon of h = 5; developing an analog target and adapt scheme is left for future research as well.
First, we reinitialize the design and fit, but this time with a small degree of replication to stabilize the behavior of this example across knitr builds.
Parallel evaluation is possible via mclapply from parallel R Core Team (2019).

R> library("parallel") R> ncores <-1 # or: detectCores() R> for(i in 1
Next, we obtain predictions on the grid.Observe that this lookahead-biased EI approach leads to substantial replication in the areas of interest, judiciously balancing exploration and exploitation in order to pin down the global minima of the mean surface.The actual number of unique locations is extracted as follows, which is a small fraction of the total budget of N = 500.

R> nrow(mod$X0)
[1] 111 Most of the replication is focused in the areas of primary interest from an optimization perspective, around x ≈ 0.75, where we observe an inferior local maximum with low noise and, more intensively, around the true global optima located at the other end of the input space.EI is also available for TPs, and the corresponding modifications are provided in Appendix A.

Contour finding.
A related problem is contour finding, also known as level-set estimation.
The objective is to identify a region of inputs of interest: Γ f = x ∈ R d : Y (x) > T with T a given threshold, often zero without loss of generality.We describe briefly here the criteria defined by Lyu et al. (2018) for both GPs and TPs that are implemented in hetGP.Several others can be found in the literature (Chevalier, Ginsbourger, Bect, and Molchanov 2013;Bogunovic et al. 2016;Azzimonti, Ginsbourger, Chevalier, Bect, and Richet 2016), with a selection of implementations provided by the KrigInv package (Chevalier, Picheny, and Ginsbourger 2014b;Chevalier, Picheny, Ginsbourger, and Azzimonti 2018).
The simplest criterion is maximum contour uncertainty (MCU), implemented by crit_MCU in the hetGP package.MCU is based on the local probability of misclassification, namely, that f is incorrectly predicted to be below or above the threshold (see also, e.g., Bichon, Eldred, Swiler, Mahadevan, and McFarland 2008;Ranjan, Bingham, and Michailidis 2008).A second criterion, contour stepwise uncertainty reduction (cSUR), implemented by crit_cSUR in hetGP, amounts to calculating a one-step-ahead reduction of MCU.A more computationally intensive, but more global, alternative involves integrating cSUR over the domain in a manner similar to IMSPE for variance reduction or so-called integrated expected conditional improvement (IECI, Gramacy and Lee 2011) for Bayesian optimization.In practice, the integral is approximated by a finite sum, which is the approach taken by crit_ICU in hetGP.The final criterion available, targeted mean square error (tMSE) (Picheny, Ginsbourger, Roustant, Haftka, and Kim 2010), is implemented in crit_tMSE and is designed to reduce the variance close to the limiting contour via a weight function.
For illustration, let us consider finding the zero contour in our simple 1D example.The code below illustrates the crit_cSUR criteria, the hetGP package's generic crit_optim interface.
Otherwise the setup and for loop are identical to those for our IMSPE and EI-based examples.

R> nrow(mod$X0)
[1] 156 In Figure 10, the repartition of unique designs is nontrivial around locations of crossing of the threshold, with larger spread (and less replication) where the crossing is in noisy areas.As with EI, this preliminary setup is likely to be improved upon after further research.

Summary and discussion
We have introduced the R hetGP for heteroskedastic Gaussian process regression, sequential experimental design, and Bayesian optimization.Although the package is designed for dealing with noise and changing signal-to-noise ratios in the setting of Gaussian process regression, it offers a full-featured GP regression approach.Ordinary homoskedastic (and noise-free) GP modeling is supported.When the data is observed with noise, the package implements a Woodbury trick to make inference more efficient, decomposing matrices sized x y by the number of unique input sites, rather than ones sized by the full data set.Moreover, key functions of hetGP are implemented in C++ with Rcpp (Eddelbuettel and François 2011;Eddelbuettel 2013;Eddelbuettel and Balamuta 2018).This leads to dramatically faster inference compared with other GP packages on CRAN when the level of replication is high.
Working only with unique inputs has other advantages, particularly it comes to coupled-GP inference for nonlinearly changing (latent) variance variables along with the usual meanbased analysis.By creating a unifying likelihood-based inferential framework, complete with closed-form derivative expressions, library-based optimization methods (e.g., optim in R) can be deployed for efficient heteroskedastic modeling.Whereas the earlier MCMC-based approaches could at best handle dozens of observations, hetGP can handle thousands in a reasonable amount of computing time.
Although relevant to machine learning and spatial data applications, the methods in the hetGP package target stochastic computer modeling applications, which often exhibit heterogeneous noise effects.Agent-based models are a good example.In that setting, the design of the experiment is at least as important as modeling.Here we introduced several sequential design schemes that work with hetGP model objects to organically grow the design toward accurate (low-variance) prediction, optimization, and the search for level sets.In all three scenarios, the scheme is able to balance exploration, exploitation, and replication as a means of obtaining efficient predictions for those targets.Extensions are provided to accommodate new sequential design acquisition strategies toward novel surrogate modeling and prediction applications.

Figure 3 Figure 3 :
Figure 3 shows the resulting predictive surface on a dense 2D grid.The left panel in the figure shows the predictive mean surface.The right panel shows the predicted standard deviation.Text overlaid on the panels indicates the location of the training data inputs and the number of replicates observed thereon.
= list(X0 = log10(thetas), Z0 = colMeans(log(bfs)), + mult = rep(nrow(bfs), ncol(bfs))), Z = log(as.numeric(bfs)),+ lower = 10^(-4), upper = 5, covtype = "Matern5_2") Predictive evaluations were then extracted on a grid in the input space.The results are shown in Figure 4.In the figure, each open circle is a BF StN evaluation, plotted in log 10 − log e space.In the left panel of the figure figure, in addition to showing full posterior uncertainty (on Y (x)|D N ), we highlight a key advantage of the heteroskedastic framework inferential framework implemented hetGP: the ability to provide error bars on both the prediction of the (latent random) mean (field) f .The right panel in the figure shows output provided by the plot method, utilizing leave-one-out formulas for GPs

Figure 4 :
Figure 4: Left: heteroskedastic TP fit to the Bayes factor data under exponential hyperprior.Right: output given by the plot method.

Figure 5 :
Figure 5: Heteroskedastic TP fit to the Bayes factor data under Gamma hyperprior.

Figure 6 :
Figure 6: Two hypothetical variance functions (left) and the resulting IMSPE surfaces (right) with corresponding minima marked by diamonds.The dashed grey line represents the replication threshold (left).

Figure 7 :
Figure 7: Sequential design with horizon h = 5.The truth is in black and the predictive distribution in red.

Figure 8 :
Figure 8: Left: Horizons chosen per iteration; right: final design and predictions versus the truth, similar to Figure 7.
ytrue <-f1d2(xgrid) R> yy <-fr(xgrid) R> rbind(rmse = c(h5 = mean((ytrue -p$mean)^2), + ha = mean((ytrue -p.a$mean)^2)), + score = c(h5 = -mean((yy -p$mean)^2 / pvar + log(pvar)), + ha = -mean((yy -p.a$mean)^2 / pvar.a+ log(pvar.a)))) For a larger example, let us revisit the ATO application from Section 3.1.Binois et al. (2019) described a sequential design scheme utilizing fixed, adaptive, and target horizon schemes.The ato data object loaded earlier contained a "hetGP"-class model called out.a that was trained with an adaptive horizon IMSPE-based sequential design scheme.The size of that design and the time it took to train are quoted by the output of the R code below.R> c(n = nrow(out.a$X0),N = length(out.a$Z),time = out.a$time)earlier experiment involved n = 1000 unique sites with an average of five replicates at each, for a total of about N = 5000 training data points.The training set here is much smaller, having N = 2000 at n = 1194 unique locations.Thus, the adaptive design has more unique locations but still a nontrivial degree of replication, resulting in many fewer overall runs of the ATO simulator.Utilizing the same out-of-sample test set from the previous score-based comparison, the code below calculates predictions and scores with this new sequential design.R> sc.a <-scores(out.a,Xtest = Xtest, Ztest = Ztest) R> c(batch = sc, adaptive = sc.a)batch adaptive 3.394957 3.616028

Figure 9
Figure 9 visualization similar to our IMSPE-based ones in Section 4.1.

Figure 9 :
Figure 9: Sequential optimization with horizon h = 5.The truth is in black and the predictive distribution in red .

Figure 10 :
Figure 10: Sequential contour finding with horizon h = 5.The truth is in black and the predictive distribution in red.