Good practices for Bayesian Optimization of high dimensional structured spaces

The increasing availability of structured but high dimensional data has opened new opportunities for optimization. One emerging and promising avenue is the exploration of unsupervised methods for projecting structured high dimensional data into low dimensional continuous representations, simplifying the optimization problem and enabling the application of traditional optimization methods. However, this line of research has been purely methodological with little connection to the needs of practitioners so far. In this paper, we study the effect of different search space design choices for performing Bayesian Optimization in high dimensional structured datasets. In particular, we analyse the influence of the dimensionality of the latent space, the role of the acquisition function and evaluate new methods to automatically define the optimization bounds in the latent space. Finally, based on experimental results using synthetic and real datasets, we provide recommendations for the practitioners.


Introduction
Let X be an input space and f : X → R be a continuous black-box function. We are interested in solving the global optimization problem of finding the unknown minimum of f : (1) We make three assumptions: 1. We can query f using noisy queries y = f (x) + , where ∼ N(0, σ).
These kinds of problems are ubiquitous, as many problems in bioinformatics, chemical engineering and computer science involve optimizing structured objects, such as graphs or images. One important example is designing new molecules, which can be laborious since testing the chemical properties requires wet room experiments, that require manual work by an expert and expensive special equipment.
In traditional applications with a low number of continuous parameters, Bayesian Optimization (BO) is the de facto solution for these gradient-free black-box optimization problems. The core idea of BO is to build a surrogate probabilistic model that efficiently guides the sequential acquisition of new data. However, building and optimizing these probabilistic surrogates in structure high dimensional spaces is challenging and often leads to poor performance.
The recent increase in the availability of high volume datasets has made it possible to use semi-supervised deep generative models to efficiently embed structured high dimensional objects into a lower-dimensional Euclidean space. These models have enabled the use of gradient-free optimization methods for optimizing the structure in the low dimensional manifold (Griffiths and Hernández-Lobato [2017], Kusner et al. [2017]). However, this research has so far been merely methodological and with no emphasis on how to design the optimization task itself. There is no research on a) how to decide on the dimensionality of the low dimensional embedding and how this decision affects the optimization task; b) how to optimize the acquisition function in the low dimensional manifold; and c) how to balance between exploration and exploitation when selecting new points. We systematically study the effects of these design choices applied to a variety of high dimen-sional structured optimization tasks. We hope that our findings will help practitioners to better design their high dimensional structured optimization problems.
The rest of the paper has the following structure. Section 2 presents the related work and section 3 introduces the theoretical background. Section 4 describes the framework to perform BO by exploiting deep generative models and the main design choices analysed in the paper. Section 5, describes the experimental set-up and presents the main results. Finally, in section 6, the paper is concluded with discussion and recommendations for the practitioners.

Related work
Bayesian Optimization (BO) is considered as the method of choice for sample-efficient gradient-free optimization of low dimensional Euclidean spaces Brochu et al. [2010], Shahriari et al. [2015]. The biggest limitation of BO has been its scalability with respect to the dimensionality of the search space.
The earliest solutions to scale BO to higher-dimensional spaces are based on projecting the input space to low dimensional spaces using linear transformations. Wang et al. [2013] use random linear projections where the BO is performed. Garnett et al. [2013] optimize the linear projection during the optimization to further improve the performance. The main disadvantages of these methods are that they are only able to find linear manifolds Another strategy for high dimensional optimization is to better understand the structure of the search space and impose additional assumptions that simplify the problem. Kandasamy et al. [2015] assume that the search space is composed of disjoint low dimensional subspaces that can be optimized separately. Mutny and Krause [2018] extend the approach by allowing overlapping subspaces. However, these methods rely on hand-crafted assumptions that may be violated in most real-world applications.
The most recent approaches take advantage of deep generative models to both reduce dimensionality and take advantage of the structure of the latent space. These approaches assume access to a reservoir of unlabelled data that can be used in learning a nonlinear, low dimensional manifold from the data in an unsupervised way. Griffiths and Hernández-Lobato [2017] use a twostep approach by first training an Auto Encoder (AE) with all unlabelled data to find a low dimensional representation of the problem and then performing BO in the low-dimensional latent space. Kusner et al. [2017] extend the approach to better handle the uncertainty by resorting to deep generative models. In particular, rather than using an AE in the first stage, they resort to a Variational Auto Encoder (VAE) to model the uncertainty to the latent space. In the second stage, they use a Gaussian Process Latent Variable Model (GPLVM) to propagate the uncertainty of the latent space improving the overall performance.
The biggest problem with these two-step approaches is that the latent space learned while only using the unlabelled data might not be optimal for the optimization task. Eissman et al. [2018] address this issue by jointly learning the latent space with the labelled data and iteratively modifying the latent representation as new data is being collected. Tripp et al. [2020] further exploit the retraining of the latent space by weighing the samples used in training the VAE based on their observed values so that good observations have more importance in training the latent space.
The approaches combining deep generative models with Gaussian Processes seem the most prominent approach to the problem at the moment. The strength of the approach is in its ability to use the unlabelled data together with the labelled data. The use of the unlabelled data allows the use of non-linear manifolds for optimization and is thus suitable for complex real world problems. This is the reason why in this paper we concentrate on these methods. Specifically, we use VAEs as deep generative models due to them having become the de facto method for the problem. The popularity of VAEs has led to them being applied on different types of structured data, including but not limited to images [Hou et al., 2017], text [Yang et al., 2017], sound [Roberts et al., 2017] and structured physical objects like molecules [Gómez-Bombarelli et al., 2018]. This allows practitioners from many fields of work to benefit from our work. In the remaining of the paper, we analyze the effect of different design choices in the optimization approaches based on VAEs. There is a need for this kind of work as these approaches appear very promising to the practitioners, but the existing research does not yet study the effect of different design choices when applying these methods in practice.

Theory and derivations
In this section we introduce all necessary parts to define the general framework we will use for the analysis of Bayesian Optimization using deep generative models. The section first introduces Variational Auto Encoders (VAEs), then Gaussian Latent Variable Models (GPLVMs) and finally how to combine VAEs and GPLVMs.

Variational Auto Encoders
In Variational Auto Encoders (VAEs) the aim is to find a latent representation z ∈ R d for data x ∈ X . Let θ parametrize a probabilistic decoder p θ (x | z) with prior distribution p(z). The posterior distribution p θ (z | x) ∝ p θ (x | z)p(z) can be interpreted as a probabilistic encoder, which in most cases is intractable. In order to address this, the posterior needs to be approximated with a tractable distribution q φ (z | x), where φ parametrizes the encoder. The parameters θ and φ can jointly be learnt by maximizing the evidence lower bound (ELBO) which can be done with gradient descent as long as decoder and encoder approximation are differentiable with respect to θ and φ and are computable pointwise. A normal choice for the encoder distribution is multivariate normal, where mean and (usually) diagonal covariance are outputs of a neural network. Decoder distributions vary more, depending on th type of the data. Bernoulli distributions can be used to decode binary data, continuous Bernoulli for bounded data (Loaiza-Ganem and Cunningham [2019]) and (log) normal distribution for (half-bounded) continuous data.

Gaussian Process Latent Variable Models
Traditional BO approaches assume that the true latent black-box function f is a realization of random variables sampled from a Gaussian process, p(f ) = GP fully specified by a prior mean and some covariance function K [Rasmussen, 2003]. The prior mean, which is often zero, defines the prior mean of the latent function and the covariance function specifies the covariance of the latent function between any two points. The problem with full GPs is their scalability. Assuming N observations from the latent function, computing the posterior of a full GP requires inverting a N × N matrix.
GPs can be approximated, and made more scalable, by using inducing inputs. In this method we use inducing latent values u (at locations z u ) in the latent space instead of latent values f with observations y (at input locations z). Using this notation, the posterior of the data becomes The posterior of the inducing latent values, is intractable for general likelihood p(y | f ) and for efficient prediction needs to be approximated. Approximating it with normal q(u) = N u | m, L L , with general mean and general lower triangular matrix provides computationally tractable properties.
The GP log posterior likelihood can be approximated with a lower bound where p(u) = N u | µ prior , Σ prior , with µ prior and Σ prior , which are the gp prior mean and prior covariance. Furthermore, with p(y | f ) = N y | f, σ 2 , the likelihood inside the expectation in Equation (4) reduces as where A = K f u K −1 uu . This model is referred to as Stochastic Variational Gaussian process (svgp, Snelson and Ghahramani [2005], Titsias [2009]).
Since the latent space of a VAEs is uncertain, we can also add uncertainty to the locations z by assuming that the projections in the latent space follow an unknown distribution q φ (z | x) = N(z | µ φ (x), Σ φ (x)), parametrized by the encoder. This uncertainty can be added to the lower bound as This model, GP with uncertainty in the inputs, is called Gaussian Process Latent Variable Model (GPLVM) (Lawrence [2004]).

Gaussian Processes in the latent space of Variational Auto Encoder
To the best of our knowledge, there are two widely used alternatives for combining GPLVM with VAE. For general notation, let {X o , Y o } be the observed data and observations and X u be the unobserved data.

Training VAE and GPLVM disjointly
The original and simplest way of combining GPLVM with VAE was introduced by Kusner et al. [2017]. Their approach is to train the VAE with X u prior to any observations using equation (2). After training the VAE, its parameters are fixed and the GPLVM is trained by maximizing equation (6)

Training VAE and GPLVM jointly
An alternative approach is to train the parameters of the VAE and GPLVM jointly like Eissman et al. [2018]. Their approach uses separate cost functions for the labelled and unlabelled data. For labelled data costs of equations (2) and (6) are combined as, For unlabelled data, the cost defined in equation (2) is used.

Bayesian optimization with variational auto encoders
Bayesian Optimization (BO) is a gradient-free blackbox optimization method. The iterative steps of any BO algorithm are: 1) Train the probabilistic surrogate model, usually a Gaussian process or in our case GPLVM, using the available data; 2) Evaluate the black-box function at the maximum of the acquisition function; 3) Update the existing data with new evaluation; 4) Repeat steps 1, 2 and 3 until a certain stopping criterion is met. When dealing with high dimensional structured spaces, the first step also includes (re)-training the latent space either jointly or disjointly with the surrogate model. Also, the optimization bounds of the acquisition function need to be re-learned on every iteration as the mapping to the latent space constantly changes (see algorithm 1). In this section we discuss different design choices for applying BO to tasks in high dimensional structured spaces: 1) the dimensionality of the latent space, 2) the choice of the acquisition function and 3) the choice of the optimization bounds in the latent space.

The dimensionality of the latent space
Choosing the right dimensionality of the latent space is complex and task dependent. A too low dimensional latent space affects the quality of the samples in X produced by the decoder. On the other hand, a too high dimensional latent space makes fitting the GPLVM in the latent space harder and not as sample efficient. In addition, a too high dimensional space leads to overfitting and poor generalization. The aim of iteratively learning the latent space with the collected observations is to make the optimization task easier in the latent space, but it is yet an open research question how the methods perform when the latent space dimensionality is varied.
Algorithm 1 Bayesian Optimization with variational auto encoders. (Re-)Learn the encoder, decoder and GP parameters θ, φ using X u , X o and Y o 4: (Re-)Learn the space Z ∈ R d where the acquisition function is optimized using X u , θ and φ

5:
Find the next location z * in the latent space by maximizing the acquisition function z * = arg max z∈Z A(z)

6:
Project z * to the original data space as x * using the learned decoder parameters θ 7: Find label y * using f (x * ) 8: : until Evaluation budget is over or acquisition is lower than threshold

The choice of acquisition function
It has been deeply studied how different acquisition functions perform in low dimensional Euclidean spaces that are prevalent in traditional BO applications. All acquisition functions balance between exploration and exploitation; the tendency of sampling from regions with lots of uncertainty versus tendency of sampling from regions with known good values. Here we explore the role of the acquisition in structured highdimensional spaces. In particular, the acquisition functions used in this paper include Thompson Sampling (TS) (Thompson [1933], Chapelle and Li [2011]), Expected Improvement (EI), Probability of Improvement (PI) and Lower Confidence Bound (LCB). (Snoek et al. [2012])

Optimization bounds of the acquisition function
Unlike in the regular low dimensional BO, the selection of the optimization bounds of the acquisition function is a difficult design choice. Normally optimization bounds for the acquisition function are selected based on expert knowledge or physical constraints. As the latent space formed by the VAE is an abstraction and as such is not tied to the business domain of the problem at hand, selecting the bounds for it is much harder. So far, the de facto method has been to bound the optimization of the acquisition function by a hypercube containing the projection of the training data in the latent space.
To the best of our knowledge, no alternatives to this method have been explored.
We demonstrate three methods for restricting the optimization space. An easy and scalable approximation is to find a minimum volume n-ellipsoid, (x − x 0 ) T A(x − x 0 ) = 1, that contains (the means of) all the training data. An easy and scalable way of doing this is via the Khachiyan algorithm Moshtagh et al. [2005]. Another easy, but not as scalable way is to find a set of hyperplanes restricting the data and form a set of linear inequalities of the form Ax b. It is easy to find this set of inequalities by first finding a convex hull for the existing samples and then perform Delaunay triangulation for the edge points and use the edge points of each simplex to find the hyperplanes. The combined time complexity of these operations is O(n 2 ), where n is the total number of unlabelled points. The benefit of both these methods is that the optimization of the acquisition function can be performed in a convex set.
Third approach of limiting the optimization space is setting an upper limit to the allowed distance between a point in the latent space z and the expected value of the encoder of the expected value of the decoded z z − zp θ z xq φ (x|z) dx dz .
This approach guarantees that the uncertainty is reduced in the regions where the acquisition function is originally meant to be evaluated. A good strategy for selecting the upper bound of the Euclidean distance is to see how much all the points in the training data move and select e.g. the 90% percentile of these distances. The drawback of the approach is that the allowed region is not necessarily convex, making the optimization harder. All these strategies are visualized in figure 1 for a VAE trained with the Shape dataset (see details of the Shape data set in section 5).

Experimental Setup
All case studies are performed with the two variations of the high dimensional structured BO as described in 3.3. We used MXNet ( Figure 1: Different space restriction strategies for acquisition function optimization visualized from left to right ellipse, convex hull and distance method. In the distance method the maximum distance the point can move is computed as a 90% of the maximum of the move distances of latent space means of the train data. The white regions indicate the area where the acquisition function is optimized. The red continuous line visualizes the region which contains 99% of the data used to train the VAE.
learning rate 10 −3 . The GPLVM model has 150 inducing points and a squared exponential kernel with each latent dimension having its own length scale parameter. Learning rate 10 −1 is used for the GPLVM parameters. All parameters are trained using 'Adam'algorithm Kingma and Ba [2014]. Prior to starting the optimizations routine, the GPLVM model is initialized with 10 observations sampled uniformly at random from the training data. Following Tripp et al. [2020], the VAE parameters are allowed to change only every 10 iterations. The purpose of this is to save computation time and reduce overfitting.

Data sets
We design three structured high-dimensional optimization task based on the following datasets: Airline passenger dataset is a time-series data consisting of the number of monthly airline passengers from January 1949 to December 1960Box et al. [2011. The black-box function is the mean square error of a model fitted on 66% of the data on a test set consisting of 33% of the data. Following the experimental setting in Lu et al. [2018], the fitted model is a Gaussian Process whose kernel is generated by a grammar with four basis kernels (periodic, squared exponential, linear and rational quadratic) and two operators (+ and *) so that there are at maximum 4 kernel combined in the produced kernel.
2d shape area maximization dataset is an image dataset consisting of rotated black rectangles of a varying area on a background of 10 by 10 pixels. The black-box function is the area of the rotated shape. The dataset mimics the dSprites dataset (Matthey et al. [2017]) and is used for optimization in Tripp et al.
[2020]. The dataset simulates a situation, where the true optimum is not inside the training data and finding the optimum requires modifying the latent space.
Molecule dataset consists of valid molecules of carbon, oxygen, nitrogen and iron each containing up to 7 atoms in total and was first introduced by Fink and Reymond [2007]. The molecules in the dataset are described by the SMILES grammar (Weininger [1988]). SMILES molecules can be embedded in a low dimensional space using a VAEs as described by Kusner et al. [2017]. The black-box function to be optimized is the penalized water-octanol partition coefficient that mimics the drug-likeliness of a molecule Ertl and Schuffenhauer [2009] and is also used in Kusner et al. [2017].

Effect of the dimensionality of the latent space
To study the robustness of changing the dimensionality of the latent space d, we compare the optimization performance on the three datasets and two models when trying different dimensionalities d ∈ {3, 4, 5, 6}.
The acquisition space is restricted by a hypercube and the acquisition function is set to the Lower Confidence Bound. The results are visualized in figure 2. The results show a clear trend for all datasets and both methods. For Airline, Shape and SMILES the best performance is obtained with 4 dimensional latent space. This is true for both tested method, but for the Shape dataset, at later iterations 5 dimensional latent space performs better than 4 dimensional latent space. The results also show that if the dimensionality is increased, the performance increases until it plateaus or starts to decrease again. This is caused by one one hand the optimization problem becoming harder in high dimensions but on the other hand the latent space becoming more nuanced with more dimensions. The first property makes BO harder and the second property reduces how noisy the black-box function appears in the latent space.
The only dataset for which the joint training method outperforms the disjoint method is the Shape-dataset. The reason for this is that the joint training often leads to overfitting, which reduces its performance. However, the Shape dataset is optimal for joint training as the minimum of a black-box function is outside the data used to train the VAE. In other words, excellent performance requires big changes in VAE

Effect of different optimization space selection strategies
To study the effect of different acquisition space optimization strategies, we compare the optimization performance of the three datasets and two models using the three acquisition space restriction strategies described in Section 1, i.e. the hyperellipsoid, minimum convex hull and metric based on minimizing the extrapolation error. In addition to these, the simple hyper rectangle approach traditionally used in the BO literature is used as a baseline. The results are visualized in Figure 3.
The results are surprising as different optimization space restriction strategies seem to only have a very minimal effect on the results. To understand why we need to first understand how the VAE works on the edges of the search space. If a point z ∈ R d is selected from the corner of the hypercube in the latent space, it needs to be projected to xq φ (x | z )dx = x ∈ X to be evaluated. As the corner of the hypercube is outside the data (or on the edge if we are lucky), the decoder needs to extrapolate as it has not been trained using this kind of data. After evaluation, for x to be usable for the GPLVM, it needs to be projected back to the latent space R d (as a distribution p θ (z * | x )).
Since the decoder extrapolates, the point projected back to the latent space is not the same as z . figure 4 visualizes the Euclidean distance between the original point in the latent space X and the point that has first been projected to the original space x = xp(x | z )dx and then back to the latent space z * = z * q φ (z | x )dz. The figure shows that the further away the point is from the data used to train the VAE, the bigger the distance is. The figure does not show it, but the points sampled from outside the training data travel closer to the training data.
The poor extrapolation of VAE also changes how the BO works. Figure 5 visualizes this on the airline dataset with the acquisition function LCB. The figure visualizes how the acquisition function looks in the latent space of the VAE using two different optimization space restriction strategies in different phases of the BO loop. As the optimization space restriction methods are different, the maximums of the acquisition functions are in different locations. However, for the hyperrectangle approach, as the point is outside the training data, the decoder has to extrapolate and when the extrapolated and evaluated point is projected back to the latent space, it has traveled towards the data set (the distance between the red '*' on the first row and blue '+' on the second row). When comparing to the convex hull approach, the evaluated point has not traveled that much. After all, the effect of both methods is very similar. Counter intuitively, the poor extrapolation of Darker colors mean shorter distance and brighter colors mean larger distance. The VAE which latent space is visualized is trained using the Shape-data.
the VAE causes similar behaviour as more sophisticated optimization space restriction strategies.

Effect of different acquisition strategies
To study the effect of different acquisition functions, we compare the optimization performance of the three datasets and two models using four different acquisition functions using four dimensional latent space. The results are visualized in figure 6.
The different acquisition strategies show a behaviour that is comparable to their expected behaviour in regular, low dimensional, BO settings. This means that there is no single best acquisition strategy that would rule them all. For Airline data, exploitative strategies EI and PI perform better than the explorative strategies. For Shape data, explorative strategies perform better than the exploitative strategies, which is natural as the minimum is outside the training data. For SMILES data, explorative strategies perform slightly better. There is no significant difference between different methods for the same acquisition function. The order of the performance is similar for both the tested methods. Also, as demonstrated in the previous experiments, the performance of the joint training for Airline and SMILES datasets is affected by overfitting.

Conclusion
In this paper, we demonstrated how to design an optimization task in high dimensional structured spaces. Acquisition low Acquisition high Optimization bounds Initial observations New observation Candidate location Figure 5: The figure shows, on the example of the Airline dataset, how the acquisition function changes during one iteration in a BO for two separate optimization techniques (columns) a) space bounded with a hyper rectangle bounding the training data b) space bounded with a smallest convex hull bounding the training data. On iteration 0, the only difference between the two techniques is the location of the maximum of the acquisition function (marked as *). Even though the acquisition function surfaces are identical the maximums within the optimization area are different. On iteration 1, the maximum of the previous iteration has been projected to the original space to be evaluated and the evaluated point has been projected back to the latent space (marked as +). Since for the hyperrectangle approach, the maximum of the acquisition function is outside the data used to train the VAE, the distance between the maximum of the acquisition function (*) at iteration 0 and the evaluated point (+) at iteration 1 is large. As the maximum of the acquisition function is within the training data for the convex hull approach, the distance is much smaller. Since the maximums of the acquisition function are close to each other at iteration 0 for both restriction strategies, the points are evaluated at the same location and both methods have identical outcome after the first iteration.
We concentrated on methods combining deep generative models, such as Variational Auto Encoders, and Gaussian processes so that the regression model and optimization are performed in the smooth Euclidean low dimensional space found by the generative model.
We demonstrated the effect of dimensionality, optimiza- tion space restriction strategy and acquisition functions on the optimization performance. The results show that there is an optimum dimension for the deep generative model in which the optimization yields the best performance. We also demonstrated that there is no need to restrict the optimization space of the acquisition function. In addition to this, we demonstrated that acquisition functions have similar performance in structured problems as they do in the case of regular, low dimensional BO.
We also demonstrated that tuning the latent space by jointly learning the latent space and the GP model might lead to overfitting and an overall drop in performance. This can be seen in the case of Airline and Shape datasets, where the disjoint training approach performed better.
The remaining open question and potential topic for further research is how to avoid the apparent overfitting of the VAE parameters when training the VAE and GPLVM models jointly. Another topic for further research is to use sparse Variational Auto Encoders (Tonolini et al. [2020]) in BO to avoid the problem of having to decide the optimal latent space dimensionality (as demonstrated in Section 4.1).