EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions

Surrogate algorithms such as Bayesian optimisation are especially designed for black-box optimisation problems with expensive objectives, such as hyperparameter tuning or simulation-based optimisation. In the literature, these algorithms are usually evaluated with synthetic benchmarks which are well established but have no expensive objective, and only on one or two real-life applications which vary wildly between papers. There is a clear lack of standardisation when it comes to benchmarking surrogate algorithms on real-life, expensive, black-box objective functions. This makes it very difficult to draw conclusions on the effect of algorithmic contributions and to give substantial advice on which method to use when. A new benchmark library, EXPObench, provides first steps towards such a standardisation. The library is used to provide an extensive comparison of six different surrogate algorithms on four expensive optimisation problems from different real-life applications. This has led to new insights regarding the relative importance of exploration, the evaluation time of the objective, and the used model. We also provide rules of thumb for which surrogate algorithm to use in which situation. A further contribution is that we make the algorithms and benchmark problem instances publicly available, contributing to more uniform analysis of surrogate algorithms. Most importantly, we include the performance of the six algorithms on all evaluated problem instances. This results in a unique new dataset that lowers the bar for researching new methods as the number of expensive evaluations required for comparison is significantly reduced.


I. INTRODUCTION
Unlike other black-box optimisation algorithms, surrogatebased optimisation algorithms such as Bayesian optimisation [1] are designed specifically to solve problems with expensive objective functions. Examples are music generation [2], materials science [3], temperature control [4], optics [5], and computer vision [6]. By making use of a surrogate model that approximates the objective function, these algorithms achieve good results even with a low number of function evaluations. However, the training and usage of the surrogate model is more computationally intensive than the use of typical black-box optimisation heuristics such as local search or population-based methods. This complicates thorough benchmarking of surrogate algorithms.
The current way of benchmarking surrogate algorithms does not give complete insight into the strengths and weaknesses of the different algorithms, for a variety of reasons. The most important reason is the lack of a standard benchmark set of problems that come from real-life applications and that also have expensive objective functions. Another reason is the lack of insight in the computational efficiency of surrogate algorithms.
In this work, we compare several surrogate algorithms on the same set of expensive optimisation problems from reallife applications, resulting in a public benchmark library that can be easily extended both with new surrogate algorithms, as well as with new problems. Our other contributions are: • the creation of a meta-algorithmic dataset of surrogate algorithm performance on real-life expensive problems, • insight into the strengths and weaknesses of existing surrogate algorithms and verifying existing knowledge from literature, • investigating how algorithm performance depends on the available computational resources and the cost of the expensive objective, • separating the effects of the choice of surrogate model and the acquisition step of the different algorithms. We furthermore show that continuous models can be used on discrete problems and vice versa. The main insights that we obtained are that the accuracy of a surrogate model and the choice of using a continuous or discrete model, are less important than the evaluation time of the objective and the way the surrogate algorithm explores the search space.

II. BACKGROUND AND RELATED WORK
This section starts by giving a short explanation of surrogate-based optimisation algorithms, or surrogate algorithms for short. We then describe some of the shortcomings in the way surrogate algorithms are currently benchmarked: the lack of standardised benchmarks and the lack of insight in computational efficiency. Finally, we give an overview of related benchmark libraries and show how our library fills an important gap.

A. Surrogate-based optimisation algorithms
In surrogate optimisation, the goal is to minimise an expensive black-box objective function where X ⊆ R d is the d-dimensional search space with d the number of decision variables. The objective can be expensive arXiv:2106.04618v1 [cs.
LG] 8 Jun 2021 for various reasons, but in this work we assume f is expensive in terms of computational resources, as it involves running a simulator or algorithm. Optimising f using standard blackbox optimisation algorithms such as local search methods or population-based techniques may require too many evaluations of the expensive objective. Surrogate algorithms reduce the number of required objective evaluations by iterating over three steps at every iteration i: 1) (Evaluation) Evaluate y i = f (x i ) for a candidate solution x i . 2) (Training) Update the surrogate model g : X → R by fitting the new data point (x i , y i ). 3) (Acquisition) Use g to determine a new candidate solution x i+1 . Usually, in the first R iterations x i is chosen randomly and therefore the acquisition step is skipped for these iterations. The training step consists of machine learning techniques such as Gaussian processes or random forests, where the goal is to approximate the objective f with a surrogate model g. For the acquisition step, an acquisition function α is used that indicates which region of the search space is the most promising by trading off exploration and exploitation: x i+1 = argmax x∈X α(g(x)). ( Example acquisition functions are Expected Improvement, Upper Confidence Bound, or Thompson sampling [1]. By far the most common surrogate algorithm is Bayesian optimisation [1], [7], which typically uses a Gaussian process surrogate model. Other common surrogate models are random forests, as used in the SMAC algorithm [8], and Parzen estimators, as used in HyperOpt [6]. Our own earlier work contains random Fourier features as surrogate models in the DONE algorithm [5] and piece-wise linear surrogate models in the IDONE and MVRSM algorithms [9], [10]. An overview of different methods and their surrogate models is given in Table I. Details about which methods are included in the comparison are given in Section III-B.
Since the training and acquisition steps above are relatively expensive even if f would not be expensive to evaluate, surrogate algorithms are usually not compared on large numbers of synthetic benchmarks, as is common for other black-box optimisation methods that require less computational resources per iteration. The most common way of benchmarking in surrogate-algorithm literature at this moment is to test a new or variant surrogate algorithm and compare it with similar algorithms on some synthetic functions, and then on one or two real-life applications, which vary wildly across papers. While this approach makes sense given the computational resources required to run surrogate algorithms, it does not give enough insight into which algorithm to use in practice. In this section we briefly describe two shortcomings in the current way of benchmarking surrogate algorithms.
B. Shortcoming 1: lack of standardised real-life benchmarks As mentioned, surrogate models have been applied to expensive objective functions in many different areas. A ques-tionnaire on real-life optimisation problems [14] confirms that this type of objective function often appears in practice: "For example, we find that work specialising in handling expensive optimisation problems (such as e.g. surrogate-assisted optimisation) is highly relevant, as several responses to the questionnaire report objective/constraint evaluation times of more than one day." Since most surrogate algorithms are developed with the goal of being applicable to many different problems, these algorithms should be tested on multiple benchmark functions. Preferably, these benchmarks are standardised, meaning that they are publicly available, easy to test on, and used by a variety of researchers. For synthetic benchmarks, standardised benchmark libraries such as COCO [15] have been around for several years now, and these types of benchmark functions are often used for the testing of surrogate algorithms as well. However, benchmarks from real-life applications are much harder to find, even though they are common in practice. As noted in [16], "unfortunately, despite its importance, studies to compare various optimisation algorithms on real-world problems are still limited, mainly because such problems are typically not publicly available. " Simply taking the benchmarking results on synthetic functions and applying them to expensive real-life applications, or adding a delay to the synthetic function, is not enough [16]- [20]. An example is the ESP benchmark discussed later in this paper. For this benchmark we have noticed that changing only one of the variables at a time leads to no change in the objective value at all, meaning that there are more 'plateaus' than in typical synthetic functions used in black-box optimisation. Another example is that of hyperparameter tuning, a problem that is known to sometimes contain properties that are not present in common synthetic benchmark functions [21]. In general, expensive objectives are often expensive because they are the result of some kind of complex simulation or algorithm, and the resulting fitness landscape is therefore much harder to analyse/model than that of a synthetic function. In contrast, synthetic functions can simply be described with a mathematical function.
It is clear that there is a need for a more standardised approach in benchmarking real-life optimisation problems, especially for surrogate algorithms. Even though this lack of expensive benchmarks already holds for the typical problems assumed in surrogate algorithms, namely that of continuous optimisation of expensive objectives, the need is particularly high for expensive discrete problems as noted in [22]: "While benchmarking is still not resolved for continuous modelbased optimisation, the situation is even less settled in the discrete domain. Of the few published, real-world, expensive, combinatorial problems, most are not openly accessible. Even in case of availability, the benchmark set would be rather small and the expense of computation would hinder broader experimental studies." C. Shortcoming 2: lack of insight in computational efficiency In many works on surrogate algorithms, computation times of the algorithms are not taken into consideration, and are  [8], [11] Random forest HyperOpt [6] Parzen estimator Bayesian Opt. [1], [12] Gaussian process (GP) CoCaBO [13] GP+multi-armed bandit DONE [5] Random Fourier IDONE [9] Piece-wise linear MVRSM [10] Piece-wise linear often not even reported. This is because of the underlying assumption that the expensive objective is the bottleneck. However, completely disregarding the computation time of the surrogate algorithm leads to the development of algorithms that are too time-consuming to be used in practice. In some cases, the algorithms are even slower than the objective function of the real-life application, shifting the bottleneck from the expensive objective to the algorithm. Computation times should be reported, preferably for problems of different dimensions so that the scalability of the algorithms can be investigated. This also helps answering the open question posed in [18]: "One central question to answer is at what point an optimisation problem is expensive "enough" to warrant the application of surrogate-assisted methods." Since many surrogate algorithms have a computational complexity that increases with every new function evaluation [10], even more preferable is to report the computation time used by the surrogate algorithm at every iteration to gain more insight into the time it takes to run surrogate algorithms for different numbers of iterations. Besides the computation time used by the algorithms, different real-life applications have different budgets available that put a limit on the number of function evaluations or total computation time. Taking this computational budget into account is a key issue when tackling real-world problems using surrogate models [16]. Yet for most surrogate algorithms, it is not clear how they would perform for different computational budgets.

D. Related benchmark environments
From the way surrogate algorithms are currently benchmarked and the shortcomings that come with it, we conclude that we do not sufficiently understand the performance regarding both quality and run-time on realistic expensive blackbox optimisation problems. A benchmark library can help in gaining more insight as algorithms are compared on the same set of test functions. In the context of black-box optimisation, such a library consists of multiple objective functions and their details (such as the search space and problem dimension, whether variables are continuous or integer, etc.) and possibly of baseline algorithms that can be applied to the problems. For non-expensive problems, many such libraries exist [15], [23], [24], particularly with synthetic functions. Some of these libraries also contain real-life functions that are not expensive [25]- [27]. See Table II for an overview of related benchmark environments.
The real-life problems to which surrogate algorithms are usually applied can roughly be divided into computer science problems and engineering problems, or digital and physical problems. Examples of the former are automated algorithm configuration [8] and hyperparameter tuning for machine learning [28], while the latter deal with (simulators of) a physical problem such as aerodynamic optimisation [29]. Even though surrogate models are used in both problem domains, these two communities often stay separate: most benchmark libraries that contain expensive real-life optimisation problems only deal with one of the two types, for example in automated machine learning [30]- [34] or computational fluid dynamics [35]. A notable exception is Nevergrad [36], which contains a wide variety of problems varying from power plant simulation to neural control of robots. The problem with focusing on only one of the two domains is that domain-specific techniques such as early stopping of machine learning algorithms [37] or adding gradient information from differential equations [38], [39] are exploited when designing new surrogate algorithms, making it difficult to transfer the domain-independent scientific progress in surrogate algorithms from one domain to the other.
Most of the benchmark libraries do not contain any benchmark solutions in the form of surrogate algorithms, and often not even any type of solution at all. Besides Nevergrad, benchmark libraries that contain more than one surrogate algorithm and more than one expensive problem are computer science libraries such as HPOlib, BayesMark, AMLB, and AClib, which do not contain any engineering or physical problems. One exception is SUMO [40], a commercial toolbox that contains many surrogate models, and a wide variety of applications. Unfortunately, this Matlab tool is over 10 years old, and only a restricted version is available for researchers, making it less suitable for benchmarking. It only supports low-dimensional continuous problems, and newer surrogate algorithms that were developed in the last decade are not implemented.
What is currently missing is a modern benchmark library that is aimed at real-life expensive benchmark functions not just from computer science but also from engineering, and that also contains baseline surrogate algorithms that can easily be applied to these benchmarks such as SMAC, HyperOpt, and Bayesian optimisation with Gaussian processses.

III. PROPOSED BENCHMARK LIBRARY: EXPOBENCH
In this section we introduce EXPObench: an EXPensive Optimisation benchmark library. We propose a benchmark CoCo [15] 0 Heuristiclab [23] 0 ParadisEO [24] 0 HyFlex [25] 0 SOS [26] 0 IOHprofiler [27] 0 GBEA [34] 0 CFD [35] 0 CompModels [41] 0 NAS-Bench [31]- [33] 0 DAC-Bench [42] 0 RBFopt [43] 1 suite focusing on single-objective, expensive, real-world problems, consisting of many integer, categorical, and continuous variables or mixtures thereof. The problems come from different engineering and computer science applications, and we include seven baseline surrogate algorithms to solve them. See Table II for details on how EXPObench compares to related benchmark environments. The simple framework of this benchmark library makes it possible for researchers in surrogate models to compare their algorithms on a standardised set of real-life problems, while researchers with expensive optimisation problems can easily try a standard set of surrogate algorithms on their problems. This way, our benchmark library advances the field of surrogate-based optimisation.
It should be noted that synthetic benchmark functions are still useful, as they are less time-consuming and have known properties. We therefore still include synthetic benchmarks in our library, though we do not discuss them in this work. We encourage researchers in surrogate models to use synthetic benchmarks when designing and investigating their algorithm, and then use the real-life benchmarks presented in this work as a stress test to see how their algorithms hold up against more complex and time-consuming problems.
In the remainder of this section, we describe the problems and the approaches to solve these problems that we have added to EXPObench.

A. Included Expensive Benchmark Problems
The problems that were included in EXPObench were selected in such a way that they contain a variety of applications, dimensions, and search spaces. To encourage the development of surrogate algorithms for applications other than computer science, we included several engineering problems, one of which was first introduced in the CFD benchmark library [35]. The problem dimensions were chosen to be difficult for standard surrogate algorithms: Bayesian optimisation is typically applied to problems with less than 10 variables. Two of our problems have 10 variables, though it is posible to scale them up, while the other problems contain tens or even over 100 variables. This is in line with our view of designing surrogate algorithms using easy, synthetic functions, and then testing them on more complicated real-life applications. Since discrete expensive problems are also an active research area, we included one discrete problem and even a problem with a mix of discrete and continuous variables.
The problems were carefully selected to have expensive objectives that take longer to evaluate than synthetic functions, but not so long that benchmarking becomes impossible. On our hardware (see Section IV-A), the time it takes to evaluate the objective function varies between 2 and 60 seconds depending on the problem.
We now give a short description of the four real-life expensive optimisation problems that are present in EXPObench.

1) Wind Farm Layout Optimisation (Windwake):
This benchmark utilises a wake simulator called FLORIS [49] to determine the amount of power a given wind farm layout produces. The wake effect of wind turbines can cause other turbines on a wind farm to produce less power due to turbulence. Compared to other simulators, FLORIS does not model complex effects and is therefore computationally cheaper to run.
The original simulator calculates the power of the wind farm for a given layout and randomly generated wind rose data. To make the layout more robust to different wind conditions, we decided to use as output instead the power averaged over multiple scenarios, where each scenario uses randomly generated wind rose data, generated with the same distribution. That is, we look at a Monte Carlo simulation to compute the average power, in line with applications in for example logistics [9], [50], [51]. This causes the objective to be stochastic, something that many traditional optimisation algorithms struggle with.
The computational cost of the objective is dependent on the number of wind turbines w, as well as the number of generated scenarios s. A solution is represented by a sequence of w pairs of continuous (x, y) coordinates for each wind turbine, which can take on continuous values. The output is −1 times the power averaged over multiple scenarios, which takes about 15 seconds to compute on our hardware for w = 5, s = 5.
It should be noted that this particular problem has constraints besides upper and lower bounds for the position of each wind turbine: turbines are not allowed to be located within a factor of two of each others' radius. As the goal of this work is not to compare different ways to handle constraints, we use the naive approach of incorporating the constraint in the objective. The objective simply returns 0 when constraints are violated.
2) Pipe Shape Optimisation (Pitzdaily): One of the engineering benchmark problems proposed in the CFD library [17], called PitzDaily, is pipe shape optimisation. This benchmark uses a computational fluid dynamics simulator to calculate the pressure loss for a given pipe shape. The pipe shape can be specified using 5 control points, giving 10 continuous variables in total. The time to compute the pressure loss varies from 2 to 60 seconds on our hardware. Although the search space is continuous, there are constraints to this problem: violating these constraints returns an objective value of 2, which is higher than the objective value of feasible solutions.
3) Electrostatic Precipitator (ESP): This engineering benchmark contains only discrete variables. The ESP is used in industrial gas filters to filter pollution. The spread of the gas is controlled by metal plates referred to as baffles. Each of these baffles can be solid, porous, angled, or even missing entirely. This categorical choice of configuration for each baffle constitutes the search space for this problem.
There are 49 baffle slots in total, that each have 8 categorical options. The output is calculated using a computational fluid dynamics simulator [52], which takes about 28 seconds to return the output value on our hardware.

4) Hyperparameter Optimisation and Preprocessing for XGBoost (HPO):
This automated machine learning benchmark is a hyperparameter optimisation problem: the approach, namely an XGBoost [53] classifier, has already been selected, but contains a significant number of configuration parameters of various types, including parameters on the pre-processing step. Variables are not only continuous, integer or categorical, but also conditional: some of them remain unused depending on the value of other variables. In total, there are 135 variables, most of which are categorical.
The configuration is evaluated by 5-fold cross-validation on the Steel Plates Faults dataset 2 , and the output of the objective uses this value multiplied with −1. Since there can be a trade-off between accuracy and computation time for different configurations, we set a time limit of 8 seconds, as this was roughly equal to twice the time it takes to use a default configuration on our hardware. Configurations for which the time limit is violated, return an objective value of 0.

B. Approaches
In this section we show the approaches that are considered in the benchmark library. We limit ourselves to popular singleobjective surrogate algorithms that are easily implemented and open-source, and that do not focus on extensions of the expensive optimisation problem such as a batch setting, multifidelity setting, highly constrained problems, etc. These include a Bayesian optimisation algorithm [1], [12], which uses Gaussian processes with a Matérn 5/2 kernel, SMAC [8], and HyperOpt [6]. We also include our own earlier work [5], [9], [10], with the DONE, IDONE and MVRSM algorithms. These make use of either random Fourier features or piece-wise linear functions as the surrogate model. A modern variant of Bayesian optimisation, namely CoCaBO [13], is also included in the benchmark library, but not presented in this work due 2 http://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults to the required computation time. The baseline with which all algorithms are compared is random search [54], for which we use HyperOpt's implementation. We also include several local and global search algorithms in our library (Nelder-Mead, Powell's method and basin-hopping among others), but these failed to outperform random search on all of our benchmark problems, and are therefore not presented in this work.
Not all of these algorithms can deal with all types of variables, although often naive implementations are possible: discretisation to let discrete surrogates deal with continuous variables, rounding to let continuous surrogates deal with discrete variables, and/or ignoring the conditional aspect of a variable entirely. Table I shows the types of variables that are directly supported by the surrogate models used in each algorithm.

IV. RESULTS
The different surrogate algorithms are objectively compared on all four different real-life expensive benchmark problems of EXPObench. The goals of the experiments are three-fold: • gain insight into the strengths and weaknesses of existing surrogate algorithms and verify existing knowledge from literature, • investigate how algorithm performance depends on the available computational resources and the cost of the expensive objective, • separate the effects of the choice of surrogate model and the acquisition step of the different algorithms. The results of comparing the different surrogate algorithms on the problems of EXPObench provide a new dataset that we use for these three goals, and that we make available publicly. 3 This dataset includes the points in the search space chosen for evaluation by each algorithm, the resulting value of the expensive objective, the computation time used to evaluate the objective, and the computation time used by the algorithm to suggest the candidate point. The latter includes both the training and acquisition steps of the algorithms, as it was not easy to separate these two for all algorithms. Although we perform some initial analysis on this meta-algorithmic dataset, it can also be used by future researchers in, for example, instance space analysis [55], meta-learning [56], or building new surrogate benchmarks from this tabular data [33].
We start this section by giving the experimental details, followed by the results on the four benchmark problems. We then investigate the influence of the computational budget and cost of the expensive objective, followed by a separate investigation of the choice of surrogate model.
A. Experiment details 1) Hardware: We use the same hardware when running the different surrogate algorithms on the different benchmark problems. All these experiments are performed in Python, on a Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 32 GB of RAM. Each approach and evaluation was performed using only a single CPU core.
2) Hyperparameter settings: All methods use their default hyperparameters with the exception of SMAC, which we set to deterministic mode to avoid repeating the exact same function evaluations, which drastically decreased performance in our experience. For the MVRSM method, we set the number of basis functions in purely continuous problems to 1000. We have not adapted IDONE for continuous or mixed problems.
3) Normalisation: To make comparison between benchmarks easier, we normalise the best objective value found by each algorithm at each iteration in the figures shown in this section. This is done as follows: using the best objective value found by random search as a baseline, let r 0 be the average of this baseline after 1 iteration, and let r 1 be the average of this baseline after the number of random initial guesses R that each algorithm used. Then all objective values f are normalised as meaning that r 0 corresponds to a normalised objective of 0 and r 1 corresponds to a normalised objective of 1, and a higher normalised objective is better. Note that this is only used in Figure 1. This normalisation is possible since all surrogate algorithms start with the same number of random evaluations R, which we omit from the figures. Other visualisation tools that are popular in black-box optimisation, such as ECDF curves, are less suitable for our results since the optimum is unknown for our benchmarks and we look at only one benchmark at a time. 4) Software environment: EXPObench is available as a public github repository 4 and is implemented in the Python programming language. To stimulate future users to add their own problems and approaches to this library, we have taken care to make this as easy as possible and provide documentation to achieve this. We also provide an interface that can easily run one or multiple approaches on a problem in the benchmark suite using the command line interface in run_experiment.py. An example is the following code: python run_experiment.py --repetitions=7 --out-path=./results/esp --max-eval=1000 --rand-evals-all=24 esp randomsearch hyperopt bayesianoptimization This runs random search, HyperOpt and Bayesian optimisation on the ESP problem for 1000 iterations, of which the first 24 iterations are random, repeated seven times, and outputs the results in a certain folder.

B. Benchmark results
We now share the results of applying all algorithms in EXPObench to the different benchmark problems. The IDONE algorithm is only applied to the ESP problem since it does not support continuous variables. To investigate statistical significance of the results, we also report p-values of a pairwise Student's T-test at the last iteration on unnormalised data. 4 https://github.com/AlgTUDelft/ExpensiveOptimBenchmark 1) Wind Farm Layout: For the wind farm layout optimisation problem, Figure 1a shows the normalised best objective value found at each iteration by the different algorithms, as well as the computation time used by the algorithms at every iteration. All algorithms started with R = 20 random samples not shown in the figure. None of the algorithms use more computation time than the expensive objective itself, which took about 15 seconds per function evaluation. While random search is the fastest method, it fails to provide good results, as is expected for a method that does not use any model or heuristic to guide the search. Interestingly, Bayesian optimisation (BO) does not outperform random search on this problem (p > 0.6) and is outperformed by all other methods (p < 0.01), even though it is designed for problems with continuous variables. In contrast, MVRSM and SMAC both have quite a good performance on this problem while they are designed for problems with mixed variables, though they both do take up more computational resources. DONE, another algorithm designed for continuous problems, performs similar to MVRSM and SMAC (p > 0.1).
2) Pitzdaily: Figure 1b shows the results of the Pitzdaily pipe shape optimisation problem with R = 20. It can be seen that DONE fails to provide meaningful results. Upon inspection of the proposed candidate solutions, it turns out that the algorithm gets stuck on parts of the search space that violate the constraints. This happens even despite finding feasible solutions early on and despite the penalty for violating the constraints. SMAC, HyperOpt (HO) and MVRSM are the best performing methods on average, outperforming the other three methods (p < 0.05) but not each other (p > 0.6).
3) ESP: In this discrete problem, algorithms that only deal with continuous variables resort to rounding when calling the expensive objective. This is considered suboptimal in literature, however earlier work shows that this is not necessarily the case for the ESP problem [57]. Indeed, we see in Figure 1c that Bayesian optimisation is the best performing method on this problem, outperforming all methods (p < 0.03) except MVRSM and SMAC (p > 0.2). This counters the general belief that Bayesian optimisation with Gaussian processes is only adequate on low-dimensional problems with only continuous variables. Another observation is that MVRSM performs much better than IDONE (p < 0.01), which fails to significantly outperform random search (p > 0.9) even though IDONE is designed for discrete problems. The surrogate algorithms also use less computation time than the expensive objective which took about 28 seconds per iteration to evaluate. 4) XGBoost Hyperparameter Optimisation: Like in the previous benchmark, the algorithms that only deal with continuous variables use rounding for the discrete part of the search space in this problem. For dealing with conditional variables with algorithms that do not support them we use a naive approach: changing such a variable simply has no effect on the objective function when it disappears from the search space, resulting in a larger search space than necessary. Figure 1d shows the results for this benchmark. This time, results are less surprising as SMAC and HyperOpt, two algorithms designed for hyperparameter optimisation with conditional variables, give the best performance. Though they perform similar to each other (p > 0.1), they outperform all other methods (p < 0.03). MVRSM is designed for mixed-variable search spaces like in this problem, but not for conditional variables, and outperforms random search (p < 0.03) but not BO and DONE (p > 0.05). BO and DONE both fail to outperform random search (p > 0.1). If we also consider computation time, HyperOpt appears to be a better choice than SMAC, being faster by more than an order of magnitude.

C. Varying time budget and function evaluation time
In real-life optimisation problems that have limitations on the time to solve the task, the computation time of the optimisation algorithms is important to take into consideration. Hence, we investigate how the algorithms perform with various time budgets and different objective evaluation times. More specifically, instead of restricting the number of evaluations as done up until now, the algorithms are stopped if their runtime exceeds a fixed time budget. The runtime includes both the total function evaluation time as well as the computation time required for the training and acquisition steps of the algorithm.
This experiment extends the results of the benchmark by also putting emphasis on the computation time of the algorithm in addition to their respective sample efficiency. On top of that, it provides information that can be used to decide which algorithm is suitable given a time budget and how expensive the objective function is.
To investigate this in practice, we use the data gathered in the experiments shown in this section by artificially changing the time budget and evaluation time of the expensive objective functions as in earlier work [57]. Because we know the algorithms' computation times from each iteration in the experiments, it is possible to simulate what the total runtime would be if the function evaluation time is adjusted. Then, we report which algorithm returns the best solution when the time budget has been reached for various time budgets and evaluations times.
The evaluation time ranges from 0.12 ms to 36 hours, while the time budget ranges from 0.49 ms to 36 hours. In case the time budget is not reached within the maximum number of iterations that we have observed from the other experiments, for at least one of the algorithms, no results are reported. Figure 2 displays which algorithm returns the best solution at each problem for a variety of time budgets (x-axis) and function evaluation times (y-axis). Each algorithm has a different marker, and the colour indicates the objective value of the best found solution (without normalisation, so lower is better). As expected, we observe that the objective value decreases when the time budget increases and the evaluation time remains fixed. However, it appears that different algorithms perform well in regions with certain time budgets and evaluation times.
For the Windwake problem we see that BO, HyperOpt, SMAC, DONE and random search all perform the best in different settings. BO seems to perform best when the number of iterations is low no matter the time budget, SMAC performs best for larger time budgets and evaluation times, and random search performs best for low evaluation times. HyperOpt and DONE perform well on semi-expensive objective functions in the 10 − 1000 ms range.
The observations are similar for the PitzDaily and ESP problems, except that DONE had a poor performance on the PitzDaily problem and SMAC gets outperformed by BO on the ESP problem.
Lastly, for the hyperparameter optimisation problem, it can be seen that HyperOpt is favoured over SMAC due to its computational efficiency, though SMAC performs well with cheaper objective functions. Given a low enough time budget, random search gives the best results, even for expensive objective functions.

D. Offline learning of surrogates
As a final experiment we investigate the choice of surrogate model in the different surrogate algorithms. Though an extensive investigation would require significant adaptations to the algorithms and their implementations, as the choice of surrogate model is heavily intertwined with the choice of the acquisition step in each algorithm, we show how the dataset generated in this work can be used in a simple offline supervised learning framework. This is achieved by training and testing different models on the data and considering the resulting errors, to discover how well different models fit the data for different problem domains.
We limit the scope to the Pitzdaily and ESP problems here, and generate different training sets for each (more data, as well as standard deviations, can be found in the appendix). Each training set consists of the first 500 candidate solutions and objective function values gathered by one run of a specific algorithm, including the first random iterations. We then train a variety of machine learning models on this dataset, with the goal of predicting the (unnormalised) objective function value corresponding to the candidate solution. Using a quadratic loss function, this results in a number of machine learning models equal to the number of algorithms times the number of runs, for each type of machine learning model. The models we used are taken from the Python scikit-learn library [58], and we also add XG-Boost and the piece-wise linear model used by the IDONE and MVRSM algorithms, giving seven models in total: a linear regression model (Linear), polynomial regression with degree 2 (Quadratic), the piece-wise linear model used by MVRSM and IDONE (PWL), a random forest with default hyperparameters (RF), an XGBoost model with default hyperparameters (XGBoost), the Gaussian process used by Bayesian optimisation (GP), and a multi-layer perceptron with default hyperparameters (MLP).
As a test set, for each problem we concatenate all the candidate points and function evaluations that were evaluated by each surrogate algorithm for every run, and keep the 1000 points with the best objective value. As the global optimum is unknown in our benchmark problems, this shows how the different models would perform in good regions of the search space.
If we only train on data gathered with random search we can already see that some models are prone to overfitting. Table III shows  are not necessarily the most accurate near the optimum, and may even be outperformed by a simple linear regression model there. For example, the quadratic model has more parameters than training data points for the ESP problem, which causes this expected behavior as it does not make use of regularisation. Furthermore, discrete models such as random forest and XGBoost with their default hyperparameters have a good generalisation performance, not just on the discrete ESP problem but also on the continuous Pitzdaily problem, even though their training error is a bit higher than that of other models.
If we train models on data gathered by a surrogate algorithm that uses that model or an approximation thereof, we get the results shown in Table IV. The piece-wise linear model (PWL) used for training is exactly the same surrogate model used by the IDONE and MVRSM algorithms, and the Gaussian process (GP) is exactly the same as the one used by Bayesian optimisation (BO). The random forest (RF) is not exactly the one used by SMAC, but uses default hyperparameters, and the model used by DONE is only an approximation of a Gaussian process. We see that the training error on data gathered by DONE can get very low, but this does not mean that DONE is a good surrogate algorithm, as we saw it perform very poorly on the Pitzdaily problem. A likely explanation is that the acquisition is not leading to the right data points. More interesting are the test errors: though a Gaussian process trained on data gathered by a surrogate algorithm that uses this model (BO) receives a low test error, an XGBoost model trained on data gathered by random search can get an even lower test error as seen in Table III. The test error for XGBoost trained on data gathered by BO, not shown in these tables,  is 0.997 for the Pitzdaily problem and 0.701 for the ESP problem. Lastly, surrogate models do not always achieve a lower training error on data gathered by a method that uses that model than on data gathered by random search.

V. DISCUSSION
Based on the observations of the previous section, we highlight the most important insights that were obtained. First of all, the type of variable a surrogate model is designed for, is not necessarily a good indicator of the performance of the surrogate algorithm in case of a real-life problem: discrete surrogates can perform well on continuous problems, and vice versa. We saw this on the wind farm layout optimisation problem, a continuous problem where a discrete surrogate model (SMAC's random forest) had the best performance, and on the ESP problem, a discrete problem where the continuous Gaussian process surrogate model had the best performance even though it was unable to outperform random search on the wind farm layout optimisation problem. Part of these insights were known from previous work [57], but we extended these insights to continuous problems and to more benchmark problems and surrogate algorithms. The experiments using offline learning of machine learning models also showed that discrete models such as XGBoost can have lower generalisation error than continuous models, even on data coming from a continuous problem like Pitzdaily.
Second, our observations lead us to believe that exploration is more important than model accuracy in surrogate algorithms. The offline learning experiments showed that surrogate models trained on data gathered by an algorithm that uses that model are not necessarily more accurate than surrogate models trained on data gathered by random search, a highexploration method. The use of random search should also not be underestimated, as the experiment where we artificially change the evaluation time of the objective shows that for all considered benchmarks there are situations where random search outperforms all surrogate algorithms, mainly when the objective evaluation time is low. Furthermore, on the ESP problem, MVRSM had a much better performance than IDONE, even though they use exactly the same piece-wise linear surrogate model on that problem. The only difference between the two algorithms is that MVRSM has a higher exploration rate. The low training error of the piece-wise linear surrogate model shows that a highly accurate model does not necessarily lead to a better performance of the surrogate algorithm using that model.
Finally, the available time budget and the evaluation time of the objective strongly influence which algorithm is the best choice for a certain problem. This can be seen from the experiment where we artificially change the function evaluation time, where the best performing algorithm changes depending on the available time budget and function evaluation time. Given that most algorithms had a lower computation time than the function evaluation time in our experiments, there is room for improvement to spend resources on improving the exploration strategy or on using an ensemble method. Using an ensemble of surrogate models or acquisition strategies should also help with robustness, as many algorithms failed to outperform random search for at least one problem. Alternatively, speeding up the surrogate algorithms should make them more suitable for less expensive problems than the benchmarks in this paper. Having a fixed computation time for every iteration, which is the case for the DONE, IDONE and MVRSM algorithms, should help in this aspect.

VI. CONCLUSION AND FUTURE WORK
We have benchmarked six surrogate algorithm on four reallife expensive benchmark problems, which gave rise to: • a public benchmark library called EXPObench that contains real-life expensive benchmark functions and baseline algorithms that can solve these benchmarks; • insights into how different aspects of the problems and algorithms influence the algorithm performance; • a public dataset containing the algorithm performance on these benchmarks.
The benchmark library fills an important gap in the current landscape of optimisation benchmark libraries, that mostly consists of cheap to evaluate benchmark functions or of expensive problems with no or limited baseline solutions from surrogate model literature. A first analysis of the dataset showed how the best choice of algorithm for a certain problem depends on the available time budget and the evaluation time of the objective, and we provided a method to extrapolate such results to real-life problems that contain expensive objective functions with different costs. The dataset also allowed us to train surrogate models offline rather than online, giving insight into the generalisation capabilities of the surrogate models and showing the potential of models such as XGBoost to be used in new surrogate algorithms in the future. Finally, we showed how continuous models can work well for discrete problems and vice versa, and we highlighted the important role of exploration in surrogate algorithms. In future work we will focus on methods that can deal with the constraints present in some of the benchmark problems from this work, as well as make a comparison with surrogate-assisted evolutionary methods.