Achieving Robustness to Aleatoric Uncertainty with Heteroscedastic Bayesian Optimisation

Bayesian optimisation is a sample-efficient search methodology that holds great promise for accelerating drug and materials discovery programs. A frequently-overlooked modelling consideration in Bayesian optimisation strategies however, is the representation of heteroscedastic aleatoric uncertainty. In many practical applications it is desirable to identify inputs with low aleatoric noise, an example of which might be a material composition which consistently displays robust properties in response to a noisy fabrication process. In this paper, we propose a heteroscedastic Bayesian optimisation scheme capable of representing and minimising aleatoric noise across the input space. Our scheme employs a heteroscedastic Gaussian process (GP) surrogate model in conjunction with two straightforward adaptations of existing acquisition functions. First, we extend the augmented expected improvement (AEI) heuristic to the heteroscedastic setting and second, we introduce the aleatoric noise-penalised expected improvement (ANPEI) heuristic. Both methodologies are capable of penalising aleatoric noise in the suggestions and yield improved performance relative to homoscedastic Bayesian optimisation and random sampling on toy problems as well as on two real-world scientific datasets. Code is available at: \url{https://github.com/Ryan-Rhys/Heteroscedastic-BO}


Introduction
Bayesian optimisation is proving to be a highly effective search methodology in areas such as drug discovery [1,2,3], materials discovery [4,5,6], chemical reaction optimisation [7,8,9], robotics [10], sensor placement [11], tissue engineering [12] and genetics [13]. Heteroscedastic aleatoric noise however, is rarely accounted for in these settings despite being an important consideration for real-world applications. Aleatoric uncertainty refers to uncertainty inherent in the observations (measurement noise) [14]. In contrast, epistemic uncertainty corresponds to model uncertainty and may be explained away given sufficient data. Heteroscedastic aleatoric noise refers to aleatoric noise which varies across the input domain and is a prevalent feature of many scientific datasets; perhaps suprisingly not only experimental datasets, but also datasets where properties are predicted computationally. One such source of heteroscedasticity in the computational case might be situations in which the accuracy of first-principles calculations deteriorate as a function of the chemical complexity of the molecule being studied [15].  In Figure 1 we illustrate real-world sources of heteroscedasticity using the FreeSolv dataset of [16]. The consequences of misrepresenting heteroscedastic noise as being homoscedastic, i.e. constant across the input domain, are illustrated using a second dataset [17] in Figure 2. The homoscedastic model can underestimate noise in certain regions of the input space which in turn could induce a Bayesian optimisation scheme to suggest values possessing large aleatoric noise. In an application such as high-throughput virtual screening [18] the cost of misrepresenting noise during the screening process could Comparison of homoscedastic and heteroscedastic GP fits to the soil phosphorus fraction dataset [17].
lead to a substantial loss of time in material fabrication [19]. In this paper we present a heteroscedastic Bayesian optimisation algorithm capable of both representing and minimising aleatoric noise in its suggestions. Our contributions are: (1) The introduction of a novel combination of surrogate model and acquisition function designed to minimise heteroscedastic aleatoric uncertainty.
(2) A demonstration of our scheme's ability to outperform naive schemes based on homoscedastic Bayesian optimisation and random sampling on toy problems as well as two real-world scientific datasets.
(3) The provision of an open-source implementation.
The paper is structured as follows: section 2 introduces related work on heteroscedastic Bayesian optimisation. Section 3 provides background on Bayesian optimisation and homoscedastic GP surrogate models. Section 4 provides background on the heteroscedastic GP surrogate model used in this work and introduces the novel HAEI and ANPEI acquisitions functions. Section 5 considers experiments on synthetic and scientific datasets possessing heteroscedastic noise where the goal is to be robust to, i.e. minimise, aleatoric noise in the suggestions. Section 6 presents an ablation study on noiseless tasks as well as tasks with homoscedastic and heteroscedastic noise in order to determine whether there is a detrimental effect to using a heteroscedastic surrogate when the noise properties of the problem are a priori unknown. Section 7 concludes with some limitations of the approach presented as well as fruitful sources for future work.

Related Work
The most similar work to our own is that of [20] where experiments are reported on a heteroscedastic Branin-Hoo toy function using the variational heteroscedastic Gaussian process (GP) approach of [21]. This work defines and optimises a robustness index, making a compelling case for penalisation of aleatoric noise in real-world Bayesian optimisation problems. A modification to expected improvement (EI), expected risk improvement is introduced in [22] and is applied to problems in robotics where robustness to aleatoric noise is desirable. In this framework however, the relative weights of performance and robustness cannot be tuned [20]. [23,24] implement heteroscedastic Bayesian optimisation but do not introduce an acquisition function that penalises aleatoric noise. [25,26] consider the related problem of safe Bayesian optimisation through implementing constraints in parameter space. In this instance, the goal of the algorithm is to enforce a performance threshold for each evaluation of the black-box function. Recently, the winners of the 2020 NeurIPS Black-Box Optimisation Competition applied non-linear output transformations in their solution to tackle heteroscedasticity. The authors however are not interested in explicitly penalising aleatoric noise in this case. In terms of acquisition functions, [27,28] propose principled approaches to handling aleatoric noise in the homoscedastic setting that could be extended to the heteroscedastic setting. Our primary focus in this work however, is to highlight that heteroscedasticity in the surrogate model is beneficial and so an examination of a subset of acquisition functions is sufficient for this purpose. We take the opportunity here to note earlier unpublished workshop versions of this paper which consider the same problem [29,30].

Bayesian Optimisation
Bayesian optimisation [31,32,33] solves the global optimisation problem defined as where x * is the global optimiser of a black-box function f : X → Y. X is the design space and is typically a compact subset of R d . What makes this optimisation problem practically relevant in applications are the following properties: (i) Black-Box Objective: We do not have the analytic form of f . We can however evaluate f pointwise anywhere in the design space X .
(ii) Expensive Evaluations: Choosing an input location x and evaluating f (x) takes a very long time.
(iii) Noise: The evaluation of a given x is a noisy process. In addition, this noise may vary across X , making the underlying process heteroscedastic.
We have a dataset D = {(x i , t i )} n i=1 consisting of observations of the black-box function f and fit a probabilistic surrogate model to these datapoints. We then leverage the predictive mean as well as the uncertainty estimates of the surrogate model to guide the acquisition of the next data point x n+1 according to a heuristic known as an acquisition function. In Bayesian optimisation, exact GPs are the most popular choice of surrogate model because of their ability to represent posterior uncertainty without resorting to approximate Bayesian inference.

Gaussian Processes
In the terminology of stochastic processes we may formally define a GP as follows: [34] is a collection of random variables, any finite number of which have a joint Gaussian distribution.
GPs can be used to set a prior over functions in Bayesian modelling applications. In this setting, the random variables consist of function values f (x) at different locations x within the design space. The GP is characterised by a mean function and a covariance function The process is written as follows In our experiments, the prior mean function will be set to the empirical mean of the data. The covariance function or kernel computes the pairwise covariance between two random variables (function values). The covariance between a pair of output values f (x) and f (x ) is a function of an input pair x and x . As such, the kernel encodes smoothness assumptions about the latent function being modelled. The most widely-utilised kernel is the squared exponential (SE) kernel where σ 2 f is the signal amplitude hyperparameter (vertical lengthscale) and is the (horizontal) lengthscale hyperparameter. Although Equation 1 is written with a single lengthscale shared across dimensions, for multidimensional input spaces we optimise a lengthscale per dimension. For consistency, we use the squared exponential kernel in all experiments reported in the main paper. In Appendix C we compare the performance of different kernels on a set of synthetic optimisation functions. For a more detailed introduction to GPs the reader is referred to [34].

Heteroscedastic Bayesian Optimisation
We wish to perform Bayesian optimisation whilst minimising input-dependent aleatoric noise. In order to represent input-dependent aleatoric noise, a heteroscedastic surrogate model is required.

The Most Likely Heteroscedastic Gaussian Process
We adopt the most likely heteroscedastic Gaussian process (MLHGP) approach of [35], and for consistency, we use the same notation as the source work in our presentation. We have a dataset D = {(x i , t i )} n i=1 in which the target values t i have been generated according to t i = f (x i ) + i . We assume independent Gaussian noise terms i ∼ N (0, σ 2 i ) with variances given by σ 2 i = r(x i ). In the heteroscedastic setting r is typically a nonconstant function over the input domain x. In order to perform Bayesian optimisation, we wish to model the predictive distribution P (t * | x * 1 , . . . , x * q ) at the query points x * 1 , . . . , x * q . Placing a GP prior on f and taking r(x) as the assumed noise function, the predictive distribution is multivariate Gaussian N (µ * , Σ * ) with mean . . , t n ) T , R = diag(r) with r = (r(x 1 ), r(x 2 ), . . . , r(x n )) T , and R * = diag(r * ) with r * = (r(x * 1 ), r(x * 2 ), . . . , r(x * q )) T .
The MLHGP algorithm [35] executes the following steps: (ii) Given G 1 , we estimate the empirical noise levels for the training data using a sample from the predictive distribution induced by the GP at (iii) Estimate a second GP, G 2 on D .
(iv) Estimate a combined GP, G 3 on D using G 2 to predict the logarithmic noise levels r i .
(v) If not converged, set G 3 to G 1 and repeat.
In essence, the defining characteristic of the MLHGP approach is that G 1 learns the latent function and G 2 learns the noise function.

Bayesian Optimisation with Aleatoric Noise Penalisation
Our heteroscedastic Bayesian optimisation problem may be framed as where the black-box objective h, to be minimised has the form where f (x) is the black-box function of the principal objective i.e. the objective corresponding to classical Bayesian optimisation where noise is not optimised, and g(x) is the latent heteroscedastic noise function which governs the magnitude of the noise at a given input location x. α is a parameter chosen, for the purposes of evaluation, by a domain expert that trades off the weight of the principal objective relative to the noise objective. It is worth noting that α is a parameter that affects only the evaluation of an algorithm and not the execution. The evaluation criteria however, will dictate the optimal hyperparameters of the acquisition function.

Heteroscedastic Acquisition Functions
We investigate extensions of the expected improvement [33] acquisition criterion, the form of which may be written in terms of the targets t and the incumbent best objective function value, η, found so far as where p(t | x) is the posterior predictive marginal density of the objective function eval- is the improvement over the incumbent best objective function value η. Evaluations of the objective are noisy in all of the problems we consider and so we use expected improvement with plug-in [36], the plug-in value being the GP predictive mean [37].
We propose two extensions to the expected improvement criterion. The first is an extension of the augmented expected improvement criterion of [38] where σ n is the fixed aleatoric noise level. AEI is introduced as a heuristic for the optimisation of noisy functions. EI is recovered in the case that σ 2 n = 0 and in the case that σ 2 n > 0 AEI operates as a rescaling of the EI acquisition function, penalising test locations where the GP predictive variance is small relative to the fixed noise level σ 2 n .
We extend AEI to the heteroscedastic setting by exchanging the fixed aleatoric noise level with the input-dependent one: where r(x) is the predicted aleatoric uncertainty at input x under the MLHGP and var[t] is the predictive variance of the MLHGP at input x. γ in this instance is defined to be a positive penalty parameter for regions with high aleatoric noise.
Proposition 1 (Limit of Large Epistemic Uncertainty). The HAEI acquisition function reduces to EI when the ratio of epistemic uncertainty to aleatoric uncertainty is much greater than γ 2 .
r(x) denote the ratio of epistemic to aleatoric uncertainty at an arbitrary input location x. Dividing the numerator and the denominator of the second term in the second factor of Equation 2 by r(x) yields Taking the limit analytically as k tends to infinity and assuming finite γ recovers the expected improvement acquisition.
Proposition 2 (Limit of Large Aleatoric Uncertainty). The HAEI acquisition function goes to zero as the ratio of epistemic uncertainty to aleatoric uncertainty goes to zero.
Proof. Taking the limit as k tends to zero in Equation 3 yields Remark. In the limit of large aleatoric uncertainty there is an approximation that is linear in k for the HAEI scaling factor.
Letting S(k) = 1 − γ √ k+γ 2 such that HAEI = EI(x)S(k), consider the MacLaurin expansion of S(k), Figure 3. The HAEI scaling factor S(k), now written as a function of k for different values of γ. When k, the ratio of epistemic to aleatoric uncertainty is small, the scaling factor goes to zero to reflect the penalty for regions of high aleatoric uncertainty. γ controls the decay rate of this penalty. Also shown is the linear approximation to the scaling factor for γ = 10.
Dropping terms of O(k 2 ) and higher we obtain This approximation may be used when k is small relative to γ and could provide guidance in setting the γ parameter if prior knowledge about k and the desired trade-off between the principal and noise objectives is available. In Figure 3 we provide insight into the effect that different values of γ will have on the scaling factor S(k).
In addition to HAEI, we propose a simple modification to EI that explicitly penalises regions of the input space with large aleatoric noise. We call this acquisition function aleatoric noise-penalised expected improvement (ANPEI) and denote it where β is a scalarisation constant. In the multiobjective optimisation setting a particular value of β will correspond to a point on the Pareto frontier. We showcase the advantages of both HAEI and ANPEI acquisition functions in conjunction with the MLHGP surrogate model in section 5.

Implementation
Experiments were run using a custom NumPy [39] implementation of GP regression and MLHGP regression. All code to reproduce the experiments is available at https: //github.com/Ryan-Rhys/Heteroscedastic-BO. The squared exponential kernel was chosen as the covariance function for both the homoscedastic GP as well as G 1 and G 2 of the MLHGP. Across all datasets, the lengthscales, , of the homoscedastic GP were initialised to 1.0 for each input dimension. The signal amplitude σ 2 f was initialised to a value of 1.0. The lengthscale, , of G 2 of the MLHGP [35] was initialised to 1.0, the initial noise level of G 2 was set to 1.0. The EM-like procedure required to train the MLHGP was run for 10 iterations and the sample size required to construct the variance estimator producing the auxiliary dataset was 100. All standard error confidence bands are computed using 50 independent random seed initialisations. Hyperparameter values, including the noise level of the homoscedastic GP, were obtained by optimising the marginal likelihood using the scipy implementation of the L-BFGS-B optimiser [40], taking the best of 20 random restarts. The objective function is for the one-dimensional sin wave experiment which is a maximisation problem and as such has a subtractive penalty for regions of large noise. For the remaining experiments, which are minimisation problems, the objective is The sin wave and Branin-Hoo tasks are initialised with 25 and 100 data points respectively drawn uniformly at random within the bounds of the design space. The soil and FreeSolv experiments are initialised with 36 and 129 data points respectively drawn uniformly at random from the datasets. α is set to 0.5 for all experiments while β is set to 0.5, 1 11 , 0.5 and 0.5 for the sin, Branin-Hoo, soil and FreeSolv experiments. γ is set to 1, 500, 1 and 1 for the sin, Branin-Hoo, soil and FreeSolv experiments. We run 5 acquisition functions in all experiments: random sampling, homoscedastic EI, AEI, HAEI and ANPEI. Homoscedastic EI is included as a baseline to demonstrate the difference consideration of aleatoric noise yields in the optimisation of the objective. AEI is included to demonstrate the difference consideration of heteroscedastic aleatoric noise yields and random sampling is included as a baseline as it is known to be highly competitive with Bayesian optimisation in noisy settings.

Heteroscedastic Sin Wave Function
The objective function has the form where f (x) = sin(x) + 0.2(x) + 3 and g(x) = 0.5(x). In this instance α from subsection 5.1 has a setting of 0.5 but we omit it explicitly as the objectives have equal weight. Over the course of the experiment samples are observed. The problem setup is depicted in Figure 4 and Figure 5. The Bayesian optimisation problem is constructed such that the first maximum in Figure 4(a) is to be preferred as samples from this region of the input space will have low aleatoric noise. The black-box objective in Figure 4(c) illustrates this trade-off. In Figure 6 we compare the performance of all surrogate model/acquisition function combinations. We observe the low aleatoric noise-seeking behaviour of HAEI and ANPEI on g(x) as well as their ability to optimise the composite objective h(x).

Heteroscedastic Branin-Hoo Function
In the second experiment we consider the objective with an additive penalty because the task is a minimisation problem and an α setting of 0.5 for equal-weight objectives.
is the standardised Branin-Hoo function introduced in [36]. The noise function g(x) is in this instance Samples are again generated according to   The problem setup is shown in Figure 7 and the performance of all surrogate model/acquisition function pairs is depicted in Figure 8. The gulf in performance between the heteroscedastic and homoscedastic surrogate models is more pronounced in this case because the noise function is more severe relative to the sin wave problem.

Soil Phosphorus Fraction Optimisation
In this experiment we consider the optimisation of the phosphorus fraction of soil. Soil phosphorus is an essential nutrient for plant growth and is widely used as a fertiliser in agriculture. While the amount of arable land worldwide is declining, global population is expanding concomitantly with food demand. As such, understanding the availability of plant nutrients that increase crop yield is a topic worthy of attention. To this end, [17] have curated a dataset on soil phosphorus, relating phosphorus content to variables such as soil particle size, total nitrogen, organic carbon and bulk density. We choose to study the relationship between bulk soil density and the phosphorus fraction, the goal being to minimise the phosphorus content of soil subject to heteroscedastic noise. In lieu of performing a formal test for heteroscedasticity, we provide evidence that there is heteroscedasticity in the dataset by comparing the fits of a homoscedastic GP and the MLHGP in Figure 2 and provide a predictive performance comparison based on negative log predictive density values in Appendix A.
In this problem, we do not have access to a continuous-valued black-box function or a ground truth noise function. As such, the surrogate models were initialised with a subset of the data and the query locations selected by Bayesian optimisation were mapped to the closest datapoints in the heldout data. The following kernel smoothing procedure was used to generate pseudo ground-truth noise values: (1) Fit a homoscedastic GP to the full dataset.
(2) At each point x i , compute the corresponding squared error s 2 i = (y i − µ(x i )) 2 . (3) Estimate variances by computing a moving average of the squared errors, where the relative weight of each s 2 i was assigned with a Gaussian kernel. The performances of heteroscedastic and homoscedastic Bayesian optimisation are compared in Figure 9. Given that regions of low phosphorus fraction coincide with regions of small aleatoric noise, we apply an α value of 1 6 to the composite objective h(x) to admit a finer granularity for distinguishing between degrees of low aleatoric noise in the solutions.

Molecular Hydration Free Energy Optimisation
We perform a retrospective virtual screening experiment with the aim of identifying molecules with favourable hydration free energy, a property important in determining the binding affinity of a drug candidate. Experiments were performed with an initialisation of 129 out of the 642 molecules in the FreeSolv dataset [41,16] over 10 iterations of data collection. Unlike the soil phosphorus fraction dataset, ground truth measurement error (aleatoric noise g(x)) values are available for the FreeSolv dataset. The remaining 513 molecules were reserved as a heldout set where at each iteration of data collection one of the heldout molecules was selected. Chemical fragments computed using RDKit [42] were used as the molecular representation based on the fact that these global features, unlike local Morgan fingerprints, act as good predictors of the hydration free energy. The fragment features were projected down to 14 components using principal component analysis, retaining more than 90% of the variance on average across random trials. The results are shown in Figure 10. Compared to previous experiments, the noise is smaller in this instance relative to the magnitude of the hydration free energy (Signal-to-noise ratio of approximately 10) and as such the heteroscedastic modelling problem is more difficult, leading to only very marginal gains in obtaining low noise solutions. While ANPEI obtains the lowest objective function value over the Bayesian optimisation trace, the results are unlikely to be statistically significant according to the standard error bands.

Heteroscedastic Acquisition Function Hyperparameters
The β hyperparameter of ANPEI in Equation 4 and the γ hyperparameter of HAEI in Equation 3 are designed to modulate the avoidance of aleatoric noise in the acquisitions. In Figure 11 we offer some intuition as to the effect of various settings of β and γ by examining the heteroscedastic Branin-Hoo function introduced in subsection 5.3. The results demonstrate that the performance of the algorithms is strongly dependent on the setting of the β hyperparameter for ANPEI whereas γ is less influential on the performance of HAEI. It is worth noting in Figure 11  if the noise objective is more important relative to the principal objective by a factor of 10 then the value of β should be 1 11 .

Robustness Experiments Summary
The experiments of this section provide strong evidence that modelling heteroscedasticity in Bayesian optimisation is a useful approach for problems in which there is a strong degree of aleatoric noise present. The ANPEI acquisition tends to outperform HAEI on the majority of the tasks where there is a small degree of aleatoric noise whilst the acquisitions are more evenly matched when the extent of the aleatoric noise is high. The outstanding questions for these methods however, is how well they perform on tasks where heteroscedastic noise is not present. Such a situation may easily arise for real-world problems where the noise properties of the tasks are a prior unknown and as such, it is important to ascertain whether there is a deleterious effect on performance in noiseless and homoscedastic noise settings.

Ablation Study on Noiseless, Homoscedastic Noise and Heteroscedastic Noise Tasks
In this section we perform an ablation study where components of the ablation constitute different noise properties. We examine the noiseless case as a base task before adding first a homoscedastic noise component and second, a heteroscedastic noise component. Additionally, we examine the effect of the size of the initialisation grid on performance in the heteroscedastic noise tasks.

Ablation
The ablation study makes use of three synthetic optimisation functions: The Branin-Hoo function, the Hosaki function and the Goldstein-Price function. The form of the Branin-Hoo function is the same standardised Branin-Hoo function introduced in Equation 6 with heteroscedastic noise function given in Equation 7. The Hosaki function, defined on the domain x 1 , x 2 ∈ [0, 5], is To facilitate the GP fit, the Hosaki function is subsequently standardised by its mean (0.817) and standard deviation (0.573). The noise function is The logarithmic Goldstein-Price function [36] is The Goldstein-Price noise function is For clarity, only the results of the Hosaki function are presented in the main paper with the Branin-Hoo and Goldstein-Price results presented in Appendix B. The Hosaki function is visualised in Figure 12. The value of β for ANPEI is set to 0.5 and the value of γ is set to 500 for all Hosaki function experiments.

Noiseless Case
In this case, the synthetic functions do not possess any observation noise and the optimisation function corresponds to the situation in Figure 12(a). 9 points sampled uniformly at random are used for initialisation and the results are displayed in Figure 13. As expected, all Bayesian optimisation methods outperform random search in the noiseless case. In this example it is unclear as to whether heteroscedastic Bayesian optimisation methods are detrimental as HAEI performs best whereas ANPEI performs worst.

Homoscedastic Noise Case
In this case the functions are subject to homoscedastic noise of the form 25 where epsilon is noise sampled from a standard Gaussian N (0, 1). The GP surrogates are again initialised with 9 points. The results are displayed in Figure 13. Hosaki function noiseless case. All Bayesian optimisation methods outperform random search. HAEI performs best and ANPEI performs worst. Figure 14. The Bayesian optimisation methods perform worse in the homoscedastic noise case relative to the noiseless case although the rank order of the methods mirrors that of the noiseless case.

Heteroscedastic Noise
In the heteroscedastic noise case the Hosaki function is subject to the noise function given in Equation 8 and visualised in Figure 12. 144 points were used to initialise the GP surrogates. The results are shown in Figure 15. In this instance, given that the extent of heteroscedastic noise is very strong (relative to the homoscedastic noise case), random search is highly competitive with the Bayesian optimisation methods. ANPEI however, is the best-performing algorithm. The large number of initialisation points chosen for this experiment reflects one limitation of the heteroscedastic surrogate approach; for the MLHGP to effectively learn a decomposition of the function into signal and noise components it needs access to more samples. As such, this merits an investigation into the effect of the number of samples on the performance of the heteroscedastic acquisitions.

Effect of Initialisation Set Size
The effect of the size of the initialisation set on the heteroscedastic Branin-Hoo task is investigated in Figure 16. The value of β used for ANPEI is 1 11 and the value of γ used for HAEI is 500. The performance of the heteroscedastic acquisitions ANPEI and   HAEI is observed to improve as the size of the initialisation set increases. In contrast, the homoscedastic methods EI and AEI do not improve on obtaining access to more samples as they are unable to model the heteroscedastic noise component of the task.

Conclusions from Ablation Experiments
Synthesising the results from the additional ablation experiments in Appendix B some trends may be observed: (i) All Bayesian optimisation methods outperform random search in the noiseless case and homoscedastic noise cases on aggregate across the three synthetic functions.
(ii) On aggregate there is no significant difference between Bayesian optimisation methods in the noiseless or homoscedastic noise cases (HAEI marginally outperforms ANPEI on 2/3 noiseless tasks and 2/3 homoscedastic noise tasks).
(iii) The heteroscedastic acquisitions ANPEI and HAEI perform competitively on the noiseless and homoscedastic noise tasks most likely because the MLHGP is capable of effecting nonstationary behaviour by "fantasising" heteroscedastic noise. As such, the MLHGP surrogate may be achieving enhanced flexibility relative to the homoscedastic GP in this setting.
(iv) The heteroscedastic acquisitions tend to outperform other Bayesian optimisation approaches on the heteroscedastic noise tasks although crucially this depends on the size of the initialisation set. In order to detect heteroscedastic noise the MLHGP surrogate needs access to more samples relative to the noiseless and homoscedastic cases.
In summary, the experiments would appear to show that there is no significant downside to employing a heteroscedastic surrogate and acquisition function on noiseless tasks or tasks with homoscedastic noise save for the increased training time for the model.

Conclusions
We have presented an approach for performing Bayesian optimisation with the explicit goal of minimising aleatoric noise in the suggestions. We posit that such an approach can prove useful for the natural sciences in the search for molecules and materials that are robust to experimental measurement noise. The synthetic function ablation study highlights no particular downside to the use of the MLHGP in conjunction with ANPEI or HAEI in cases where the noise structure of the problem is a priori unknown i.e the black-box optimisation problem is either noiseless or homoscedastic. Nonetheless, we anticipate that this type of approach may be particularly relevant for the experimental natural sciences where noiseless objectives or those with homoscedastic noise are highly uncommon. In terms of concrete recommendations on when to apply the algorithm, we foresee the best performance in situations where the user has access to a moderately-sized initialisation set in order to provide the MLHGP with enough samples to distinguish heteroscedastic noise from intrinsic function variability. There are a number of possible extensions to the current approach which may facilitate its application to high-dimensional datasets and act as fruitful sources for future work: (1) Surrogate Model: One disadvantage of the MLHGP model is the lack of convergence guarantees for the EM-like procedure required for fitting. Various other forms of heteroscedastic GP exist [43,44,45,46,47,48,49] and have demonstrated success in modelling applications [50,51,52,53]. Of particular interest for realworld problems are scalable heteroscedastic GPs [54,55] which could circumvent the computationally-intensive bottleneck of fitting multiple exact GPs as a subroutine of the MLHGP Bayesian optimisation procedure.
(2) Advances in Surrogate Model Machinery: Advances in areas such as efficient sampling of GPs [56] are liable to yield improvements to sampled-based acquisition functions such as Thompson sampling [57] while fully Bayesian approaches to hyperparameter estimation for sparse GPs [58] are liable to yield improvements in model fitting procedures.
(4) Acquisition Function Optimisation: Recent developments in acquisition function optimisation including Monte Carlo reformulations [64,65], compositional optimisers [66,65] and tight relaxations [67] of common acquisition functions have the potential to yield gains in empirical performance.
(6) Approaches for Molecular Bayesian Optimisation: In relation to molecules, the use of tailored GP kernels such as Tanimoto kernels [71,72] and more expressive dimensionality reduction techniques [73] could lead to performance gains and enhanced scalability respectively.
(7) Exploration in the Noise Objective: Incorporating exploration in the noise objective in the multi-objective setting as in [22].
Lastly, a further use-case of the machinery developed in this paper is obtained by turning the noise minimisation problem into a noise maximisation problem. As an example, in materials discovery, we may derive benefit from being antifragile [74] towards (i.e. derive benefit from) high aleatoric noise. In an application such as the search for performant perovskite solar cells, we are faced with an extremely large compositional space, with millions of potential candidates possessing high aleatoric noise for identical reproductions [75]. In this instance we may want to guide search towards a candidate possessing a high photoluminescence quantum efficiency with high aleatoric noise. If the cost of repeating material syntheses is small relative to the cost of the search, the large aleatoric noise will present opportunities to synthesise materials possessing efficiencies far in excess of their mean values.

Acknowledgements
The authors would like to thank James T. Wilson for discussion about the experimental setup and applications domains of heteroscedastic Bayesian optimisation, Henry Moss for useful discussions regarding Bayesian optimisation in the noisy setting and Luke Corcoran for discussions relating to the limiting behaviour of the HAEI acquisition. We would additionally like to thank the anonymous reviewers who were instrumental in improving the quality of the empirical analysis as well as the clarity of the manuscript. VL is funded by The Alan Turing Institute Doctoral Studentship under the EPSRC grant EP/N510129/1. Table A1 is used to demonstrate the efficacy of modelling the soil phosphorus fraction dataset using a heteroscedastic GP. The heteroscedastic GP outperforms the homoscedastic GP on prediction based on the metric of negative log predictive density (NLPD)

Appendix A. Heteroscedasticity of the Soil Phosphorus Fraction Dataset
which penalises both over and under-confident predictions.       Figure C1, Figure C2 and Figure C3 for the Branin-Hoo function, Goldstein-Price function and Hosaki functions respectively. There is no significant difference in performance using each kernel save for the Branin-Hoo function where ANPEI underperforms using the somewhat rougher exponential kernel.