Wasserstein enabled Bayesian optimization of composite functions

Bayesian optimization (BO) based on the Gaussian process model (GP-BO) has become the most used approach for the global optimization of black-box functions and computationally expensive optimization problems. BO has proved its sample efficiency and its versatility in a wide range of engineering and machine learning problems. A limiting factor in its applications is the difficulty of scaling over 15–20 dimensions. In order to mitigate this drawback, it has been remarked that optimization problems can have a lower intrinsic dimensionality. Several optimization strategies, built on this observation, map the original problem into a lower dimension manifold. In this paper we take a novel approach mapping the original problem into a space of discrete probability distributions endowed with a Wasserstein metric. The Wasserstein space is a non-linear manifold whose elements are discrete probability distributions. The input of the Gaussian process is given by discrete probability distributions and the acquisition function becomes a functional in the Wasserstein space. The minimizer of the acquisition functional in the Wasserstein space is then mapped back to the original space using a neural network. Computational results for three test functions with dimensionality ranging from 5 to 100, show that the exploration in the Wasserstein space is significantly more effective than that performed by plain Bayesian optimization in the Euclidean space and its advantage grows with the dimensions of the search space.


Motivation
The challenge of Bayesian Optimization (BO) in high dimensional problems has been addressed mapping it into low-dimensional problems defined on subsets of variables. Kandasamy et al. (2015), Moriconi et al. (2020) or exploiting a lower intrinsic dimensionality. To tackle the issue of high dimensionality a different approach is proposed in the present paper mapping the original problem into a space of discrete probability distributions.
We consider the optimization of a black-box, expensive, multi-extremal function f (x): where X is the search space and neither gradient nor convexity information are available.
Consider the following composite function for i = 1, … , n where h 1 (x), … , h n (x) is the univariate point cloud associated to x.
In the specific case of a linear scalarization of a multi-objective problem, f (x) = ∑ n i=1 i h i (x) and each vector h 1 (x), … , h n (x) is the point cloud associated to x . Another class of problems which yield naturally a distributional representation of a candidate solution are simulation-optimization problems. This is the case in which the objective function is the average performance of a system where f (x, w) is the value of f (x) under the environmental condition w and p(w) represents the "relevance" of condition w (probability of its occurrence or the fraction of time that condition w occurs). Another setting is the hyperparameter optimization of a machine learning algorithm via k-fold cross validation, with f (x, w) a loss function (e.g., predictive accuracy, fairness, explainability, etc.) on (2) f (x) = C h 1 (x), … , h n (x) 1 3 fold w using hyperparameter configuration x . The point clouds lay onto a metric space-i.e., the space of discrete probability distributions-in which a metric defines the distance between two points in that space, with the properties of positiveness, symmetry, and triangle inequality. Due to the nature of elements belonging to this space, the most appropriate distance between them is a distance between probability distributions. In this paper we focus on the Wasserstein (WST) distance and embed the original optimization problem in the metric space whose elements are discrete probability distributions, which we call Wasserstein space W . Wasserstein distance, also known as the Optimal Transport (OT) distance, is a mathematically principled method to align probability distributions. Originated by a paper of Monge (1781), it received its linear programming formulation in Kantorovich (1942). A complete mathematical formulation is in Villani (2009) while Peyré and Cuturi (2019) offer a complete review of recent theoretical and computational advances. The Wasserstein distance has been widely applied in machine learning from shape analysis (Gangbo and McCann 2000) to image interpolation, domain adaptation (Redko et al. 2019), parameter estimation in simulation models (Öcal et al. 2019), structured data on graphs (Vayer et al. 2018), active learning (Frogner et al. 2019), and adversarial networks (Arjovsky et al. 2017). The Wasserstein distance has many important properties: its representational capability has been shown by embedding in W a variety of complex objects like images, networks, and words. An explanation of the interest in the Wasserstein distance is that Euclidean embeddings of data are flawed as they account for the correspondence of each feature independently of the other features. Bayesian Optimization (BO) algorithms have so far largely focused on problems where inputs are represented as numerical and categorical variables in Euclidean spaces. A significant advance is provided in  which extends BO to Riemannian manifolds.
In this paper we extend the distributional approach to BO by encoding the geometry of the data generated in the sequential optimization process and performing the search in W . The key advantage of BO is its well-known sample efficiency. The main question considered in our study is whether its sample efficiency can be further improved by embedding the optimization process in W . An important result is the development of a multi-layer perceptron (MLP) to map the results obtained by BO in W back to original search space X . The resulting algorithm BOWS (Bayesian Optimization in the Wasserstein Space), at least for the test functions considered, outperforms "Euclidean" BO already in 10 dimensions and its competitive edge increases substantially as the dimension of the search space increases. We have only considered the case in which the probability measures are univariate discrete probability distributions (aka point clouds).

Related works
The use of Gaussian Processes with probabilistic inputs has been proposed in Candelieri et al. (2022a), but the use of the WST distance in optimization problems is still a sparsely explored field. The issue of placing optimization in the space of probability distributions has been analyzed in Zhang et al. (2018) where policy optimization in reinforcement learning is modelled using Wasserstein gradient flows and Zhang et al. (2019) where the problem of approximating the posterior distribution in Thompson sampling is solved via Wasserstein gradient flow providing also a theoretical guarantee of convergence. Since Thompson sampling (TS) is used both for sampling a Gaussian process as also as acquisition function in the Bayesian optimization framework, an optimal transport based efficient computational strategy for performing TS is directly relevant for optimization. TS is a sequential optimization process based on the following steps: updating a posterior depending on the set of observations, drawing a sample from posterior as an approximation to the function to be minimized, minimizing this sample function to identify the next candidate point and evaluating the objective function at that point. However, calculating exact posterior distributions is intractable for all but the simplest models. Therefore, the development of computationally efficient approximate methods for the posterior distributions is a crucial problem for scalable TS.
In Gong et al. (2019) and Liu and Wang (2016) it is shown how batch-BO enables to transform the optimization of the acquisition function into finding the optimal distribution in the space of all distributions. The resulting quantile variational optimization is then solved using Stein variational gradient descent. The use of gradient flows in the Wasserstein space has been proposed in Salim et al. (2020) for the identification of OT maps. Wasserstein gradient flows have been also suggested in Rout et al. (2021) and Liutkus et al. (2019) for solving the optimization problems arising in generative modelling. Another problem which has been formulated as optimization over data-generating joint probability distributions is the dataset transformation from unlabelled to labelled (Alvarez-Melis and Fusi 2021) using a particle-based method. The same approach has been proposed in Alvarez-Melis and Fusi (2020) for transfer learning via OT.
The theoretical framework of the previous papers has been reconsidered and focused on BO (Crovini et al. 2022) where the authors propose a batch sequential algorithm based on the Expected Improvement (EI) acquisition function, which is transformed into an acquisition functional defined over a space of probability measures. The key result is that this functional is concave, according to the strong factorization assumption that the probability measure of the batch points takes the form of a product measure. However, the concavity result is derived only for the batch-EI. The optimization of the acquisition functional is then based on its gradient flow over W . Two formulations of the gradient flow on the space of probability measures are considered: the Stein gradient flow and the particle-based Wasserstein gradient flow. The estimation of batch-EI and the computation of its gradient flow are quite complex involving the solution of the non-linear Fokker-Plank equation.
Other results are related to BO over Riemannian manifolds (Jaquier and Rozo 2020) focused on Robot Learning and Jacquier et al. (2020) focused on high dimensional BO, proposing an approach that builds, on the theory of Riemannian manifolds, a representation of the objective function in a low dimensional latent space.
The issue of Distributionally Robust Optimization (DRO) is analysed in Lau and Liu (2022) who propose a Wasserstein barycentric ambiguity set and ). Closer to the focus of our paper is Kandasamy et al. (2018) which use a kernel induced by the WST distance in a BO framework to search for the best neural network architecture.
Another possible approach for learning from distributions is to consider Reproducing Kernel Hilbert Spaces (RKHS). The kernels associated to probability distributions, in particular the Hilbertian kernel on probability measures have been first proposed in Hein and Bousquet (2005). A solution to the problem in the setting of Hilbert spaces has been provided in Peyré and Cuturi (2019). It must be remarked that in the case of multivariate distributions the construction of positive definite kernels on sets of probability measures is not straightforward.

Our contributions
A key contribution of this paper is to show that mapping candidate solutions from the search space into univariate discrete probability measures, specifically point clouds, associated to the components of the objective function, can be applied to obtain a BO algorithm where the Gaussian process and the acquisition function are defined over W . The mapping back from W into the original search space X is accomplished by a neural network. An indication of the convergence is obtained from a measure of concentration around the global optimum in W as the ambiguity set built upon the WST distance between point clouds. Preliminary computational results on additive benchmark functions show that the relative performance of the BOWS algorithm improves over plain BO, both in terms of function evaluations and wall-clock time, as the dimension of the search space X increases.

Organization of the paper
The contents of the paper are organized as follows. Section 2 provides background knowledge about Wasserstein distance and the optimal transport formulations. Section 3 establishes the BOWS algorithm and proposes a neural network which maps the probability distributions from W into X . Section 4 describes the experimental set-up including the algorithms considered and the parameters values for benchmarking BOWS and the computational results over the test functions. Section 5 provides conclusions and perspectives.

The Wasserstein distance between point clouds
Consider two univariate point clouds, respectively denoted with = h (1) , … , h (n) and = g (1) , … , g (m) . Since the WST distance is originally defined between two probability measures, the two point clouds are mapped into discrete probability distributions. Given a point cloud , the associated probability measure is given by: Given two point clouds, their WST distance is W 2 ( , ) = min P∈U(n,m) √ ⟨P, C 2 ⟩ , with C 2 ∈ ℜ n×m is the cost matrix between the points of the two clouds and P is the coupling matrix where P i,j denotes the weight of the assignment of h (i) tog (j) , and U is the set of all the possible assignm e n t s , t h a t i s U(n, m) = P ∈ ℜ n×m + ∶ P1 m = 1 n 1 n , P T 1 n = 1 m 1 m .
The optimal coupling P * = argmin P∈U(n,m) √ ⟨P, C 2 ⟩ can be obtained through the simplex algorithm. Since our case study considers univariate point clouds with the same cardinality (i.e., the number n of objective function's components), the computation of W 2 ( , ) can be simplified as: where h (i) and g (i) denote the sorted samples of the two point clouds. (

The "vanilla" Bayesian optimization
A Gaussian Process (GP) is a probability distribution over the mean function of the GP and k x, x ′ is the covariance function (aka kernel). Therefore, GP is as a collection of random variables, any finite number of which have a joint Gaussian distribution and f (x) can be considered as a sample from a multivariate normal distribution (Archetti and Candelieri 2019;Frazier 2018).
.,N the associated function values, possibly noisy with a zero-mean Gaussian noise ∼ N 0, 2 . Then the posterior predictive mean ( ) and standard deviation 2 ( ) , conditioned on X 1∶N and y 1∶N , are given by the following equations: The acquisition function manages the balance between exploration and exploitation, it is the key driver of the sample efficiency of BO and is an important concept also outside machine learning (Candelieri et al. 2021). It drives the search of the new evaluation points towards regions of the search space with potential better values of the objective function either because value of ( ) is better or the uncertainty represented by 2 ( ) is high (or both). A widely used acquisition function is the Confidence Bound (Lower and Upper, respectively for minimization and maximization problems): where ξ is the parameter to manage the exploration/exploitation trade-off.

Preliminaries
First, we recall here, and also introduce, some useful notations: • y ∈ ℝ is the co-domain of the objective function f ( ). • W ⊂ ℝ n is the (unknown) co-domain of the objective function's observable components, that are h 1 ( ), … , h n ( )-or compactly the point cloud ∈ W. • ( ) and 2 ( ) are the predictive mean and variance of a GP defined over the space of point clouds, W , and computed according to (5) and (6) where ∈ X is replaced by ∈ W.
• ∶ W → X is a mapping from the space univariate probability distributions back to original search space.

The BOWS's GP
For the GP model, we have decided to adopt the (Euclidean) Squared Exponential (SE) kernel, operating on the space W , that is the n-dimensional space whom the point clouds belong to. Specifically, the (Euclidean) SE kernel is: with the so-called length-scale hyperparameter which is tuned via MLE. If ∈ the kernel is said isotropic, while it is anisotropic if ∈ n .
Although using a Euclidean-based kernel on W can seem a contradiction (Candelieri et al. 2022b) prove that using a Euclidean SE kernel between univariate probability measures is equivalent to using a non-stationary anisotropic Wasserstein-based SE kernel, that is: with W 2 2 H, H ′ computed as in (4).

The BOWS's acquisition function
As the test problems considered in the paper are minimization problems, we will use LCB as acquisition function. The main difference with respect to the vanilla BO is that here LCB is defined-as well as the GP-over W instead of X . Thus, minimizing LCB leads to the next point cloud ̂ (N+1) giving the best exploration-exploitation trade-off, that is: It is important to remark that, contrary to the search space X that is defined by the user and usually box-bounded, there not exists any preliminary information about Ω ⊂ W . The unknown search space problem is intractable to solve in practice. Therefore, we decided to dynamically set up the search space Ω , according to the point clouds observed so far. This kind of procedure-which is mandatory in our case-it has been anyway proposed in vanilla BO, quite recently and named "weakly specified" search space (Nguyen et al. 2017).

Mapping from W back to X
We need to map ̂ (N+1) back to X to obtain the associated value (N+1) and, consequently y (N+1) and the actual (N+1) . Indeed, it is important to remark that any possible mapping is anyway affected by some reconstruction error. The mapping ∶ W → X is performed by a MLP trained using the sets H 1∶N = (i) i=1∶N and X 1∶N = (i) i=1∶N as input and output, respectively. The number of layers of the MLP has been set to three and in each layer the number of neurons is max(n, d) . The MLP is retrained at each iteration or after a given number of iterations, according to the user's preferences and available computational budget. On the contrary, the GP model is always trained at each iteration (as usual in BO). The MLP model provides the new point (N+1) which yields y (N+1) = f (N+1) and concurrently the actual (N+1) .
The additional computational complexity of BOWS with respect to vanilla BO is given by the training of the MLP, mapping from W back to X . A rough indication of the computational complexity is O(m 1 × m 2 × m 3 × m 4 ) . Where m 1 is the number of epochs; m 2 is the number of training examples; m 3 is the number of objective function's components; m 4 is the number of neurons. The computational overhead due to working in W and the ensuing need to map back to X is substantial and explains why the wall-clock time of BOWS is poorer than vanilla BO. It is a reasonable cost to pay for the improvement in sampling efficiency as it will be shown in the computational results in the following section.

Experimental setting
The algorithms have been implemented using BoTorch (Balandat et al. 2020) a Python library for Bayesian Optimization part of the PyTorch framework. BoTorch provides an easy-to-use interface for defining, managing, and running sequential experiments and a modular interface for composing Bayesian optimization primitives as probabilistic models, acquisition functions, and optimizers. The computational results reported in this section have been obtained using UCB (with = 4 ) and a gradient based optimizer. Three test functions have been considered (Table 1) with dimensionalityd = 5, 10, 15, 20 . For each experiment 10 independent runs have been executed with 20d iterations and d initial points.

Experimental results
In this section, the computational results on the three test functions reported in Table 1 are presented, considering dimensionality d = 5, 10, 15, 20.
As shown in Table 2, considering Alpine01 and Vincent, BOWS has a better overall performance respect to vanilla BO; the advantage increases in higher dimensionality. In Fig. 1 is highlighted that BOWS converges faster to an optimal solution, in terms of iterations, particularly considering higher d . In the case of Michalewicz, BOWS generally performs worse than standard BO. The gap in performances decreases increasing the dimensionality.
To explain the different behaviour of BOWS with the Michalewicz test function we have to look at the MLP error. The error is defined in the Wasserstein space as In the case of Alpine01 and Vincent the error appears to slightly decrease with the increasing of the iterations (Fig. 2). This is coherent to the fact that increasing the iterations means a higher number of training points for the neural network. In the case of Michalewicz, the error shows a completely opposite behaviour, meaning that the MLP cannot properly map the function's components from W back to the search space X . This behaviour is particularly marked for higher dimensionality of the search space.
The main difference between Michalewicz and the other two test function is that the Michalewicz's components depend on the number of dimensions d , and in particular they get more complex as d increases (Fig. 3). Specifically, the number of local minima is d! . In the case of Alpine01 and Vincent the complexity of the components does not depend on the dimensionality d.
The difference in complexity of the functions' components can also be seen by looking at the correlation between (i) and (i) for i = 1, … , N . As shown in Table 3, in the case of Michalewicz, the Pearson correlation is much lower than the other two test functions, meaning a higher complexity in finding a mapping function.
Since mapping the Michalewicz's components back to the search space is more complex a possible solution is to increase the number of hidden layers of the MLP. Figure 4 and Fig. 5 show that with 5 hidden layers the performance of BOWS for Michalewicz improves and the MLP error decreases.

Conclusions and future works
The main conclusion is that a distributional representation of points in the search space as point clouds can be effectively applied to Bayesian optimization. The Wasserstein distance has been chosen because it's a metric, captures complex relationships between inputs, neighbourhood sizes and connectivity and provides geometrically meaningful distances. Computational experiments show, both in terms of function evaluations and wall clock time, how the new method in two out of three benchmark functions outperforms vanilla Bayesian optimization and its advantage increases with the dimension of the search space. Future works should address the following main issues: • Methodological advances to improve the optimization of the acquisition function considering also, from a theoretical standpoint, both the differentiability of the WST distance and the relation between the gradient flows of the objective function and the transport map. • A full analysis of the optimization problems which fit into the BOWS framework. The distributional approach is natural for simulation-optimization problems over discrete structures, sensor placement in physical and informational networks and stochastic vehicle routing. Also, the issue of high dimensionality and the underlying additive structure should be further analyzed.
Additional experiments are required for a more extensive numerical validation of the proposed approach.
Funding Open access funding provided by Università degli Studi di Milano -Bicocca within the CRUI-CARE Agreement.
Data availability All code and data used in this paper are available at https:// github. com/ andre apont i5/ bows.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will  need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.