Using the knowledge gradient acquisition function in Bayesian optimization when searching for robust solutions

This article considers the use of Bayesian optimization to identify robust solutions, where robust means having a high expected performance given disturbances over the decision variables and independent noise in the output. A variant of the well-known knowledge gradient acquisition function is proposed specifically to search for robust solutions, with analytic expressions for uniformly and normally distributed disturbances. An empirical evaluation on a number of test problems demonstrates that the new acquisition function outperforms alternative approaches.


Introduction
The problem of expensive black-box optimization is considered where the goal is to identify a robust solution, robust meaning here having a high expected performance despite disturbances over the decision variables and noise in the output.
Many engineering optimization problems can only be formulated as black-box functions, e.g. if the problem involves a simulation to evaluate solutions.Such problems are challenging to optimize, in particular if the evaluation of a solution is costly or computationally expensive, as is typical in engineering.In such cases, the budget for function evaluations is very limited.Bayesian optimization (Frazier 2018) has been shown to be a powerful technique of black-box optimization, especially for problems with a limited budget of function evaluations.It is an optimization method at the interface of machine learning that iteratively builds or updates a surrogate model and uses an acquisition function to determine the solution that would give the biggest additional information value when being added to the current data set.
When optimizing complicated systems, it is not always practicable to guarantee that the implemented solution follows exactly the design specification, and sometimes the environment where the solution is deployed is unpredictable.For instance, in engineering, manufacturing tolerances mean that the produced solution may deviate from the designed solution.In such situations, a solution is usually sought that is not merely good but is also robust.
This article is aimed at finding solutions having a good expected performance even with the presence of disturbances of the decision variables and output noise.Essentially, this means that not only the solution should be good, but its neighbourhood should also have high average quality (Branke 1998).
One of the simple ways to evaluate the expected performance of a solution is to estimate it directly by sampling from the distribution of disturbances and averaging.However this is of course very computationally expensive, particularly in higher dimensional search spaces and when the output is noisy, since the method would need many samples for an accurate estimation.
In Le and Branke (2020), the robust Knowledge Gradient (rKG) acquisition function has been suggested for Bayesian optimization, an adaptation of a method for simulation optimization with input uncertainty (Pearce and Branke 2017).This article summarizes the previous work and makes the following original contributions: • The rKG is extended to normally distributed disturbances and an analytic expression for this case is derived.• The consistency of the algorithm is proved.
• The proposed algorithm is empirically compared with a recently published benchmark, NES_EP, on a number of additional test functions and it is demonstrated that it outperforms alternative state-of-the-art algorithms.
The article is organized as follows.It begin with an overview of related literature on robust optimization in Section 2. The problem is formally defined in Section 3 and the method for simulation optimization for robust solutions is explained in Section 4, followed by empirical tests and comparison with some benchmarks in Section 5. Finally, Section 6 consists of conclusions and suggestions for future work.

Literature review
Bayesian optimization has recently become a popular method for solving many kinds of problem involving costly-to-evaluate functions.In every iteration, a response surface (often called a surrogate model, emulator or metamodel) is fitted to the data collected so far, with Gaussian Processes (GPs) being the most commonly used metamodel.Based on the information given by the metamodel, the next sample is chosen, maximizing a so-called infill criterion or acquisition function.The most popular acquisition function is Expected Improvement (EI), first proposed by Mockus, Tiesis, and Zilinskas (1978) and popularized in Efficient Global Optimization (EGO) (Jones, Schonlau, and Welch 1998).Other examples are Upper Confidence Bound (Srinivas et al. 2010), Knowledge Gradient (Scott, Frazier, and Powell 2011) and Entropy Search (Hennig and Schuler 2012).A variant of EI with an adaptive exploration component has recently been proposed by Xu, Guo, and Saleh (2021).
There are a large number of publications regarding robust global optimization, but mostly based on evolutionary algorithms (Beyer and Sendhoff 2007).So far, Bayesian optimization for solving that problem has not been applied widely.Articles can be split into articles looking for worst-case robustness, where disturbance of the decision variable is defined as a compact set, and those that look for good expected performance given a probability distribution of disturbances, as in this article.There is also the related area of reliability based design optimization where the goal is to identify the best solution that remains feasible despite disturbances to the decision variables, see e.g.Yang et al. (2016).
Among the articles searching for solutions robust with respect to the worst case, Marzat, Walter, and Piet-Lahanier (2013) propose a combination of Bayesian optimization and an iterative relaxation procedure.The basic idea is to maintain a discrete set of disturbances, and alternate between seeking the robust optimal solution given the disturbance set, and seeking a new worst-case disturbance to be added to the disturbance set.In each of the two alternating optimization steps, expected improvement-based algorithms are used.Note that, in this article, it is not the decision variables that are disturbed, but the simulation model's input parameters (Pearce and Branke 2017).
ur Rehman, Langelaar, and Keulen (2014) and ur Rehman and Langelaar (2017) modify the EI acquisition function to the case of searching for a robust solution in the worst-case context.First, the nominal optimum is computed on the constructed metamodel.It is the solution with the best worst-case prediction, not the best found solution.Then the next sampling location is determined by a modified EI.For any position x in the design variable space, according to the metamodel, the worst-case x w in the disturbance region is identified, and then at that worst-case location x w the predicted performance distribution is used to evaluate the acquisition function.The solution is the robust optimum of the metamodel in the last iteration.The proposed approach appeared to be much more sample-efficient than the method from Marzat, Walter, and Piet-Lahanier (2013) in an empirical comparison on some benchmark problems.ur Rehman and Langelaar (2017) deviate from ur Rehman, Langelaar, and Keulen (2014) by considering constrained problems instead of unconstrained ones.Sanders et al. (2019) also identify the current best robust solution with respect to the worst case within the disturbance region, using the posterior mean of the constructed GP.A number of realizations (functions) are drawn from the fitted GP to determine the next disturbance region to sample in.The actual improvement over the current best solution for each realization is computed and the average of the improvements over all function realizations will be the overall expected improvement of a solution.The solution with the largest expected improvement is sought using an evolutionary algorithm.The next observation of the target function is made within the disturbance region of this solution, at the location with the largest variance as predicted by the Gaussian process.It is shown empirically on a number of benchmark problems that the method performs significantly better in comparison with the algorithm suggested by ur Rehman, Langelaar, and Keulen (2014).While the article focuses on worst-case performance, the authors mention that their algorithm, with a minor modification, could also be used for optimizing expected performance over a disturbance region.Nogueira et al. (2016) propose to approximate the expectation of the EI acquisition function given the distribution of disturbances.This so-called unscented expected improvement is computed using the unscented transformation (Julier and Uhlmann 2004).
Two very recent articles that optimize expected performance over the input disturbances have been published by Fröhlich et al. (2020) andIwazaki, Inatsu, andTakeuchi (2020).The first article considers the same problem as in this article, but adapts the Entropy Search acquisition function (Hennig and Schuler 2012).It is shown to outperform the unscented expected improvement by Nogueira et al. (2016).The latter adapts the Upper Confidence Bound acquisition function (Srinivas et al. 2010) to search for a robust solution in a multi-objective context considering both mean and variance.Thus it is similar to this work when focusing on only the mean.Both articles assume normally distributed disturbances, whereas the rKG algorithm is based on the KG idea and tested with both uniformly and normally distributed disturbances.
The method is related to the method proposed for simulation optimization with input uncertainty (Pearce and Branke 2017) or optimizing expensive integrands (Toscano-Palmerin and Frazier 2018) but adapted to the case of searching for robust solutions.

Problem definition
Let black-box function f be defined over a compact input domain X ⊂ R D .There are two kinds of uncertainty, called noise and disturbance.Observing f at a solution x is noisy, i.e. y = f (x) + with ∼ N (0, σ 2 ).The white noise is assumed to have constant variance across the input domain.Additionally, the disturbance δ is distributed around each solution x ∈ X with a probability distribution P [δ].While in this article the approach is tested with uniformly and normally distributed disturbances, it also applies to other distributions of the disturbance, although this may require reverting to Monte Carlo estimation rather than analytical derivation of the acquisition function.The goal is to maximize the expected performance given noise and disturbances, i.e. to maximize the following robustness function: ( 1 ) The Opportunity Cost (OC) is used as the measurement for the quality of the method.OC is the difference between the value of the robustness function at the true optimal robust solution and the solution that the algorithm recommends.Let x r denote the final solution returned by the algorithm, the opportunity cost is then ( 2 ) It is assumed that the latent function f can be approximated by a Gaussian process reasonably well.
There is a limited budget of N noisy observations of function f (N samples).The algorithm can choose the solution x n at each iteration n to be sampled, dependent on the information collected so far.The goal is to sample solutions in such a way that minimizes the OC of the solution x r that is returned at the end of optimization.The algorithm therefore needs to define where to sample next (i.e.what acquisition function to use) and what solution to return at the end.

Methodology
The method proposes to use Bayesian optimization, and in particular a variant of the KG acquisition function (Scott, Frazier, and Powell 2011), as an efficient way to search for robust solutions.The method uses a surrogate model and a novel acquisition function to define the next sample iteratively.
The surrogate model and the acquisition function are described in the following subsections.

Gaussian process
A surrogate model is also called an emulator, metamodel or response surface.There are several choices, such as a Gaussian process or Tree Parzen estimator.In this approach, a Gaussian process is chosen following the majority of the literature.A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams 2006).
It is characterized by its mean function μ 0 (x) and kernel (covariance function) 0 (x, x ), where The widely used constant mean function and squared-exponential kernel are chosen, i.e.
The choice of other mean functions and kernels is discussed by Rasmussen and Williams (2006).
Given the vector of observations f (x T at the vector of points x 1:n = (x 1 , . . ., x n ) T , the posterior mean μ n and posterior covariance n can be computed as follows: where I n is the identity matrix of size n and 0 (x, is also called the covariance matrix and is positive semidefinite.Maximum likelihood is usually used for tuning the hyperparameters of the model, which are signal variance α 0 , lengthscale l x and noise variance σ .

Standard knowledge gradient
In Bayesian optimization, at each iteration, the point maximizing a so-called acquisition function is chosen as the next sampling point.One of the most widely used acquisition functions is the Knowledge Gradient (KG).Frazier, Powell, and Dayanik (2009) introduced KG for optimizing over a discrete and finite set which maximizes the expected improvement of the maximal value of the posterior mean conditioned on sampling once more at a specific point.
Let F n denote the sigma-algebra generated by the data collected so far {(x 1 , f (x 1 )), . . ., (x n , f (x n ))}.Denoting by μ n (x) the posterior mean after n already taken samples {x 1 , x 2 , . . ., x n } and with the assumption that the next sample x n+1 will be at x, KG can be written as Frazier, and Powell (2011) present an extension of KG for correlated beliefs, KG for continuous parameters, which can be approximated by maximization over a finite subset of the input space.For instance, by discretizing X over a subset X n X = {x 1 , . . ., x n X } ⊂ X, KG n can be approximated as follows: (5) where Z ∼ N (0, 1) and

The Direct Robustness Approximation (DRA)
Perhaps the simplest way to apply Bayesian optimization to find the robust solution solving (1) is to estimate F(x) by sampling over δ and to apply standard acquisition functions such as the standard KG.To reduce the impact of disturbances and noise, every observation can be averaged over k independent replications.This method is called Direct Robustness Approximation (DRA(k)) (Le and Branke 2020).The idea is to approximate the robustness function F(x) directly by the GP and each observation with random disturbance and output noise is taken as a sample of F.
The GP model for DRA(k) has always to allow for observation noise even if the underlying function f is deterministic, as observations are still stochastic owing to the random input disturbance.The method returns the solution with the best posterior mean of the approximated robustness function (Le and Branke 2020).
Because each observation is an average over multiple samples, the method is computationally expensive.The estimate of the solution quality can be improved if the k samples are drawn by Latin Hypercube Sampling (LHS) (McKay, Beckman, and Conover 1979) rather than random sampling (Le and Branke 2021).

Robust knowledge gradient
The robust Knowledge Gradient (rKG) is proposed as an acquisition function to identify robust solutions.The key insight is that disturbances can usually be controlled during optimization (e.g. when running a simulation model), and only the final solution is subject to disturbances.rKG thus uses a GP to approximate the latent f, not the robustness function F, and estimates values for F from evaluating f at locations without disturbance.
From the estimation of the underlying function f with μ n (x), an approximation of the robustness function F by the robust mean is brought forward: The rKG is then simply the expected value of the increase in maximal value of the posterior robust mean conditioned on sampling once more at a specific location.So the main difference between the standard KG and the rKG is that, instead of the posterior mean, the change in the posterior robustness mean from sampling a solution is looked at.Hence rKG after n samples for continuous parameters can be written as follows: Similarly to the standard KG, the conditional mean can be rewritten as where The robust KG, rKG(x), still has a formula similar to (5) but with the adjusted components With squared-exponential kernel and constant prior mean μ 0 , the analytical formulae for each element M i and ˜ i can be derived for two of the most frequently observed distributions of disturbances: uniform and normal.Each element in M i and ˜ i can be derived analytically using Equations ( 3) and (4).
For each i = 1, . . ., n X , from (3) and (6) and from ( 4) and (7) For a D-dimensional input, the k th element of the vector in ( 9) can be computed as follows: (a) for a uniformly distributed disturbance U(− , ), .
Details of the derivation are described in Appendix A of the online supplemental data for this article, which can be accessed at https://doi.org/10.1080/0305215X.2022.2145604.The expectation in (8) can be computed using Algorithm 1 in Scott, Frazier, and Powell (2011).
At the end of optimization, a GP model over all sampled points is fitted and the point that maximizes the robustness performance of that model is the solution to return, i.e.
x N r = arg max x M N (x).
Algorithm 1 summarizes the details of rKG method of determining sampling points.
Algorithm 1: Pseudo-code for robustness optimization Place a Gaussian process prior on f Update distribution on f at n 0 initially sampled points.
Set the number of sampled points n to n 0 .while n ≤ N do Fit a GP on n samples.Identify the point with the largest value of rKG.Add it to the set of sampled points and increase n by 1. Result:

Consistency of the algorithm
It is worth noting that the algorithm is myopically optimal by construction.It can be proved that the rKG algorithm finds the true robust optimum x OPT on a compact domain X, given an infinite budget.
Theorem 4.1: If the input space is compact, the noise variance of the output is strictly positive across the input domain, f can be modelled by a GP, and with the assumption that the solution returned by the rKG algorithm converges to the true robust optimum, given an infinite budget of evaluations.
The theorem is based on the fact that rKG at any location converges to zero given an infinite budget of evaluations.This fact is proved combining the techniques of proving the convergence in Pearce, Poloczek, and Branke (2019) and for the Knowledge Gradient for Continuous Parameters (KGCP) in Scott, Frazier, and Powell (2011).Details of the proof are given in Appendix B of the online supplemental data.

Experiments
Robust KG is tested on several one-and two-dimensional benchmark problems with both uniformly and normally distributed disturbances, and compared with several alternative approaches.

Experimental setup
Initial points are sampled by LHS, the number of points is five times the number of dimensions, except for the rocket simulation, which is difficult to model by a GP and thus requires a larger number of initial points.For uniformly distributed cases, the total budget is 75, for normally distributed cases it is 50 samples for one-and two-dimensional functions and 100 for three-and four-dimensional functions.Unless stated otherwise, at each step, the HyperParameters (HP) of the GP are tuned by maximizing the marginal likelihood.Functions from the TensorFlow (TF) library (Abadi et al. 2016) are used to compute the log marginal likelihood, including the Cholesky decomposition and solving triangular systems.The maximization of the marginal likelihood, as well as of the acquisition function, starts by evaluating some random points, followed by gradient ascent from the best points using L-BFGS-B (Zhu et al. 1997) to find the maximum.
All experimental results are averaged over 100 independent runs for one-dimensional problems and 25 independent runs for two-dimensional problems.OC figures show the mean plus and minus one standard error.

Benchmark functions
The following nine test functions are used, summarized in Table 1 and visualized in Figures 1 and 2.
(5) The one-dimensional Reproducing Kernel Hilbert Space (RKHS) function from Nogueira et al. (2016), which is to be minimized.( 6 (7) A two-dimensional gravity assisted rocket trajectory simulation model from Fröhlich et al. (2020), where the goal is to identify an angle α 0 and launch speed ν 0 that minimize a cost term proportional to the launch speed and the distance of the resulting trajectory from the target planet.The two terms are combined in a single objective function J(α 0 , ν 0 ) = log 10 (d target + βν 0 ), with β being a parameter that trades-off the two terms.The disturbance for ν 0 and α 0 is ν = 0.05 and α = 3 o , respectively.(9) The piston design example, adapted from Arendt, Apley, and Chen ( 2013), with more details in Section 5.7.
All functions have multiple local optima with the optimum of the undisturbed function being different from the robust optimum given disturbances.
In case the disturbance distribution is uniform, it is capped at the boundary of the search space, i.e.
where lb and ub are the lower and upper bounds of the search space, respectively.Unless stated otherwise, the output noise is set to have standard deviation 0.1, i.e. ∼ N (0, 0.1 2 ), although some experiments are also run without output noise.In the case of normally distributed disturbances, unless stated otherwise, the standard deviation of the output noise is set, following Fröhlich et al. (2020), to 0.001, i.e. ∼ N (0, 0.001 2 ).

Benchmark methods
The rKG algorithm is compared with the following three alternative approaches.
• The Direct Robustness Approximation (DRA) at each sample location as described in Section 4.2.2.• Uniform allocation (UA), which allocates the samples by LHS (McKay, Beckman, and Conover 1979) rather than sequentially by an acquisition function.Apart from the acquisition function, the algorithm is identical to the algorithm used with rKG, i.e. it builds a GP from the data collected on the undisturbed function f and returns the solution with best robust posterior mean x N r = arg max x M N (x).• Noisy Entropy Search-Expectation Propagation (NES-EP): an approach proposed by Fröhlich et al. (2020) that uses ES (Hennig and Schuler 2012) as acquisition function.As NES-EP has only been derived for normally distributed disturbances, it is only used as a benchmark on such test problems.
Note that NES-EP as proposed by Fröhlich et al. (2020) uses a different HP tuning algorithm based on GPy (2012).Because of primary interest in the relative performances of the acquisition functions, the original HP tuning in NES-EP has been replaced by the same method used for rKG, i.e. acquisition functions compared using the same HP tuning method.However, to explore the impact of the HP tuning method, some experiments where rKG and NES-EP use the HP tuning of the original NES-EP algorithm based on GPy are also run.These algorithms are then denoted as rKG_GPy and NES-EP_GPy.

One-dimensional test function f 1
For test function f 1 (x), the robust optimum is approximately at x = 1.22.However, optimization of f using standard KG would return a solution with lower robustness performance near x = 1.873.
The experiments are started by examining parameter k of DRA(k), the number of replications to average over for a single observation.The bigger k, the smaller observation noise would be, but the more computational time per iteration.The result is shown in Figure 3(a), which plots OC over the total number of function observations so far.
With five initial solutions, DRA with k = 1 only needs five evaluations to initialize, whereas for example the run with k = 7 requires 5 × 7 = 35 evaluations to initialize.Thus, larger k means a delayed use of the acquisition function, but towards the end, the runs with k ∈ {3, 5, 7} find a slightly better solution compared to k = 1.It can be concluded that, to evaluate a single solution, it is better to use small values of k and the noise handling capability of the GP rather than many replications (bigger k), at least for this test problem.While this would have been expected for problems with just output noise, this was not so clear for problems with disturbance of decision variables, as the disturbance leads to heteroscedastic observation noise.
Figure 3(b) compares the results of rKG, UA, DRA(1) and DRA(5).Visibly, the rKG method performs better than all other methods.UA is not far behind on this simple function whereas throughout the run DRA is significantly worse.It is interesting to notice that, after the initial five samples, DRA(1) already has a worse opportunity cost, using the same information as rKG.It seems that making the relationship between f and F explicit by learning f and deriving F through integration leads to a better model than approximating F directly.
Finally, the algorithms are tested on a deterministic version of f 1 .Similarity is observed in the results of this case and the case of noisy observation, thus an additional figure was not included.It seems the implicit averaging of the kernel function makes all approaches very robust to additional output noise.

More complicated one-dimensional test function f 2
This is a minimization problem taken from Paenke, Branke, and Jin (2006).The test function has several local minima, concentrated on the left side of the input domain.Results are similar to those of test function f 1 , although UA converges more slowly compared to rKG.

Simple two-dimensional test function f 3
For this two-dimensional benchmark problem, 10 initial sample points are used, which means DRA(5) would need 50 evaluations just to initialize, which renders this method unsuitable.Therefore only rKG, UA and DRA(1) are compared.
The opportunity cost over the number of evaluations is shown in Figure 4(b).While the rKG acquisition function once more quickly converges to the correct robust optimum (opportunity cost of zero), UA and DRA(1) are significantly slower.

Noisy simple one-dimensional test function f 4
For this test problem, the output noise standard deviation is set to 0.1, i.e. ∼ N (0, 0.1 2 ).As shown in Figure 5, rKG is slightly better than NES-EP and both converge to a good solution, whereas DRA(1) does not even converge to good solutions and the opportunity cost is almost a flat line.It seems DRA(1) doesn't work well in the case of normally distributed disturbance, thus this approach is not tested with other test problems.UA is a bit better than DRA(1) but still does not return a good solution.

RKHS function
The RKHS function is tested with small output noise (σ = 0.001) and large output noise (σ = 0.1).As shown in Figure 6, the rKG approach is much better than the benchmarks for both levels of output noise.Similarly to the test with function f 4 , UA cannot return a good solution.Since UA allocates the samples uniformly, it is not able to focus on the region of the robust optimum.For higher dimensional  cases, UA requires many more samples (exponential in the number of dimensions) to have a good GP fit.Therefore this method is not tested any further.

Two-dimensional test function f 3 for normally distributed disturbance
In order to be able to compare with NES-EP, normally distributed noise is also used with test function f 3 .As can be seen in Figure 7(a), rKG works better than NES-EP for this problem.

Gaussian mixture model
As depicted in Figure 7(b), for this problem, rKG is comparable in the first 30 evaluations but then better than NES-EP towards the end.

Gravity assist manoeuvre
The proposed method is applied to the problem of the gravity assist manoeuvre as described in Section 4.3 of Fröhlich et al. (2020).As can be seen in Figure 8(a), rKG does not outperform NES-EP although both solutions are relatively poor and the error bars are pretty big.This may be because this problem has some very sharp peaks that are difficult to model with a squared exponential kernel.This issue and the dependence on HP tuning is explored more deeply in Section 5.6.

Hartmann 3D
As depicted in Figure 9(a), rKG is similar to NES-EP in the first 35 evaluations but then NES-EP converges prematurely to an inferior solution.

Results with GPy HP tuning
So far, the HP tuning method with the TF library (Abadi et al. 2016) has been used.As NES-EP uses GPy HP tuning as default, the impact of the HP tuning procedure is also tested on the relative comparison.Since GPy requires prior values, the same set of prior values given by Fröhlich et al. (2020) is used.

RKHS function with small noise and HP tuning using GPy
As shown in Figure 10(a), rKG and NES-EP perform similarly, but NES-EP loses at the end owing to few outlier runs with very poor results.This is still consistent with the results reported in Fröhlich et al. (2020) if the OC is plotted with median and percentile as shown in Figure 10(b).

GMM function with HP tuning using GPy
As can be seen in Figure 11(a), once again rKG and NES-EP are similar, but NES-EP loses at the end owing to few outlier runs with very poor results.Again, results are consistent with the article by Fröhlich et al. (2020) if the OC is plotted with median and percentile as shown in Figure 11(b).

Hartmann 3D
As depicted in Figure 9(b), rKG is slower midway but catches up and converges to a similarly good solution as NES-EP.

Gravity assist
Gravity assist was the only test problem where NES-EP performed better than rKG.However, as noted above, this problem has some very sharp peaks in an otherwise smooth area, and thus is not suitable to be modelled by a GP with squared exponential kernel.A poor HP choice will substantially disrupt GP-based algorithms.This may be somewhat alleviated by choosing a good prior and optimizing HPs using GPy.Thus the pairing of rKG_GPy and NES-EP_GPy is also compared.
As can be seen in Figure 8(b), performance of rKG_GPy and NES-EP_GPy is similar for the first 10 new evaluations.After that, rKG_GPy performs better, albeit the opportunity cost went back up a little bit towards the end.This supports the conjecture that the relatively poor performance of rKG on the gravity assist problem observed in Section 5.5.5 is due to a model mismatch.

Function f 3 for dependency on the choice of hyperparamenter prior
As the results of NES-EP_GPy are noticed to be quite sensitive to the choice of prior values of HPs, the simple two-dimensional function f 3 is tested with varying values of prior lengthscale and fixed signal variance or varying prior signal variance and fixed true lengthscale to see how the rKG method and the benchmark method depend on the choice of HPs.The fixed values of lengthscale and signal variance have been learned by maximum likelihood with a meshgrid of 6400 points (lengthscale = 0.316,639,81, signal variance = 342.074,631,937,6998).The mean of opportunity cost after 25 independent runs is tested, using the acquisition function rKG and NES-EP_GPy with the same method of HP tuning using GPy. Figure 12 shows that the rKG method is notably less sensitive to the setting of the prior.

Real-world application
Normally distributed disturbances of solution variables have an infinite support (they are not bounded).That means the integration limits in the calculation of the "expected performance" robustness are the whole input space, irrespective of the constraints on the solution space.However, in practice it is necessary to consider the reliability as it does not make sense if the disturbed solution (solution with disturbance) lies outside the region of feasibility.One way to deal with this is to assign  all solutions outside the region of feasibility a value of zero.The area where disturbed samples take non-zero values depends on the location of the solution in the feasibility region.Therefore, the limits are adjusted accordingly and the robustness with the adjusted limits is called 'Robustness with Reliability'.Note that, with the adjustment, the robust optimum solution is less likely to be at the edge of the feasibility region.An example of robustness with and without reliability is depicted in Figure 13(a).
The application of the rKG and 'Robustness with Reliability' are demonstrated by the piston design example that was analysed in Arendt, Apley, and Chen (2013).Owing to the expensive computational cost of the simulation used for the automotive piston design, whose true response surface is unknown, a GP built over the data set of 200 equally spaced points is used as a replacement (Arendt, Apley, and Chen 2013).The posterior mean of this GP is a function of four-dimensional design variable d and two-dimensional noise variable w.The data used in this experiment are normalized.More details can be found in Arendt, Apley, and Chen (2013).In this work, the problem is adapted, replacing the additional noise variables by the disturbance of the decision variables.It aims to search for the robust maximum of the function of four-dimensional design variables d, which is the posterior mean when the noise variable is fixed to the mean value (w = [0.5, 0.5]).The disturbance is distributed normally with zero mean and variance 2 = 0.05 2 around each design variable.Robustness without reliability would yield a robust optimum solution on the edge at d = [0.807,302,15,1.0, 0.0, 0.0], whereas robustness with reliability returns a solution within the region of feasibility d * = [0.791,864,73, 0.856,594,09, 0.162,340,84, 0.150,678,62].
Since NES-EP seeks for a robust solution without reliability, it cannot be used in this case.Therefore rKG is compared with UA.It starts with 20 initial samples, chosen by LHS, and finishes after 80 iterations and averages the results over 8 runs.Figure 13(b) shows that rKG returns a solution close to the true optimum much faster and more consistently (fewer fluctuations) compared to UA.

Conclusion
In many real-world optimization problems, disturbances to the decision variables, for instance manufacturing tolerances, may affect the solutions.It becomes crucial to search for a robust solution in order to reduce the sensitivity to the disturbances and achieve a high expected performance.
This article introduced an algorithm based on the knowledge gradient idea for efficiently searching for the robust optimum of expensive-to-evaluate functions, adapting the technique used in Pearce and Branke (2017) and Toscano-Palmerin and Frazier (2018) for Bayesian optimization with input uncertainty.The article suggests a novel acquisition function, called the robust Knowledge Gradient (rKG), used for iteratively identifying the next point to sample.It has been demonstrated that, in the case of uniformly distributed disturbance, with a much lower number of required function evaluations, rKG can obtain the same solution quality as alternative approaches much more efficiently.And for the case of normally distributed disturbance, the rKG method has been shown to be at least comparable to benchmark methods.Moreover, it can deal with output noise better and is less sensitive to the hyperparameter prior.
Future work entails several avenues.The approach should be tested with more problems, especially higher dimensional and real-world problems.Also, it should be tested with other measures of robustness where a quantile or the worst case is of interest rather than the expected performance.

Figure 3 .
Figure 3. OC over number of evaluations for f 1 .

Figure 5 .
Figure 5. OC over number evaluations of f 4 .

Figure 6 .
Figure 6.OC over number of evaluations of the RKHS function in (a) small noise case and (b) large noise case.

Figure 7 .
Figure 7. OC over number of evaluations for (a) f 3 with normally distributed disturbance, (b) GMM.

Figure 8 .
Figure8.OC for the gravity assist problem when (a) using the TF approach for HP tuning, (b) using GPy for HP tuning.

Figure 9 .
Figure9.OC for Hartmann 3D when (a) using the TF approach for HP tuning, (b) using GPy for HP tuning.

Figure 10 .
Figure 10.OC over number of evaluations of RKHS function with small noise and GPy used for HP tuning.(a) Mean and error bars and (b) Median and percentiles.

Figure 11 .
Figure 11.Plot of OC over number of evaluations of the GMM function.(a) Mean and error bars and (b) Median and percentiles.

Figure 13 .
Figure 13.(a) Robustness with and without reliability compared on test function f 4 ; (b) opportunity cost over number of evaluations of piston design.

Table 1 .
Benchmark functions for experiments.