Olympus: a benchmarking framework for noisy optimization and experiment planning

Research challenges encountered across science, engineering, and economics can frequently be formulated as optimization tasks. In chemistry and materials science, recent growth in laboratory digitization and automation has sparked interest in optimization-guided autonomous discovery and closed-loop experimentation. Experiment planning strategies based on off-the-shelf optimization algorithms can be employed in fully autonomous research platforms to achieve desired experimentation goals with the minimum number of trials. However, the experiment planning strategy that is most suitable to a scientific discovery task is a priori unknown while rigorous comparisons of different strategies are highly time and resource demanding. As optimization algorithms are typically benchmarked on low-dimensional synthetic functions, it is unclear how their performance would translate to noisy, higher-dimensional experimental tasks encountered in chemistry and materials science. We introduce Olympus, a software package that provides a consistent and easy-to-use framework for benchmarking optimization algorithms against realistic experiments emulated via probabilistic deep-learning models. Olympus includes a collection of experimentally derived benchmark sets from chemistry and materials science and a suite of experiment planning strategies that can be easily accessed via a user-friendly python interface. Furthermore, Olympus facilitates the integration, testing, and sharing of custom algorithms and user-defined datasets. In brief, Olympus mitigates the barriers associated with benchmarking optimization algorithms on realistic experimental scenarios, promoting data sharing and the creation of a standard framework for evaluating the performance of experiment planning strategies


I. INTRODUCTION
Optimization tasks are ubiquitous across science, engineering, and economics. They typically involve the identification of specific choices for controllable parameters under which a system of interest yields a desired response. The development of efficient strategies that lead to the discovery of such optimal parameter choices is of significant importance and has long been of interest to many scientific communities. Selecting an appropriate optimization strategy for a problem with unknown structure is non-trivial given that a single, overall superior strategy does not exist. 1,2 Specifically, the qualities of a single optimization strategy including convergence, computational demand, or requirements on the function to be optimized, could be ideal for some applications but render the same strategy inapplicable to other tasks. Understanding the challenges of optimization tasks in specific domains and the behavior of different algorithms for such tasks is essential to the development of efficient search strategies that are suitable to the considered application. Empirical assessments of the performance of different optimization strategies on realistic and domainrelevant scenarios is thus of paramount practical relevance.
One aspect where optimization has recently gained increased attention is the digitization of scientific discovery with autonomous platforms. [3][4][5] The emergence of ever more sophisticated and reliable automated experimentation equipment in chemistry and materials science over the last decades has increasingly allowed for formulation of scientific discovery as an optimization task. [6][7][8] In this formulation, compositions of candidate materials and processing conditions to fabricate multi-component materials are optimized to reach desired goals with respect to the physical and chemical properties of the synthesized material. Key missions in these fields relate to the discovery of functional molecules and advanced materials to tackle societal challenges such as climate change, renewable energy, sustainability, or clean water, which can be directly approached by modifying the structures and compositions of candidate materials to optimize their physical and chemical properties. 9 Automated instrumentation is now being combined arXiv:2010.04153v2 [stat.ML] 30 Mar 2021 with data-driven optimization strategies to enable autonomous molecule and materials development in self-driving laboratories. 10 Autonomous experimentation leverages these data-driven strategies to suggest molecules or materials candidates that are synthesized and characterized by robotic platforms, [10][11][12][13] with realtime feedback on the suggested candidates being collected in the form of physical or chemical measurements. In this vision, the experimentation process requires minimal human intervention once the experimental campaign has been defined. The integration of algorithmic experiment planners with robotic hardware into an autonomous platform has already been shown to substantially lower the development costs of organic photovoltaic materials, 14 identify novel chemical reactions, 15 yield unexpected findings for thin film technologies, 16 the discovery of photocatalysts for hydrogen production from water, 17 and mechanical design, 18 amongst other applications. Several different optimization strategies have already been used for automated scientific discovery. While some of these optimization algorithms have been designed for broad applicability across general optimization tasks, other approaches have been developed with the more specific goal of planning laboratory experiments and are based on assumptions about the expected experimental response surfaces. For example, Design of experiments (DoE) constitutes a frequently employed strategy to identify optimal conditions for chemical reactions, 19,20 where the system of interest is probed on a grid of different parameter choices. Chemical reactions have also been optimized with the Snobfit algorithm, [21][22][23] variants of the Simplex method, [24][25][26] or even gradientbased strategies. 26 Bayesian optimization frameworks have been demonstrated on materials science applications, most often realized using Gaussian processes [27][28][29] or random forests. 30 While the experiment planning strategies deployed in the aforementioned examples enabled autonomous workflows, it is not clear whether they are the most efficient ones for the considered task. In fact, it has recently been reported that ill-chosen planners can increase the budget requirements for scientific discovery in the context of materials science by up to an order of magnitude. 31 Without comprehensive benchmarks, availability and ease-of-use might be the primary considerations behind the choice of experiment planning strategy, while other factors such as the speed of convergence or the computational demand are neglected. The lack of the ability to evaluate the effectiveness of different experiment planning strategies thus poses a major obstacle to the development of autonomous research platforms.
To resolve this challenge, we propose to benchmark experiment planning strategies on probabilistic models. These models can emulate noisy experimental responses after being trained on experimental data, as previously demonstrated in the context of multi-objective optimization with autonomous research platforms. 32 In particu-lar, we suggest to use Bayesian neural networks (BNNs) due to their robustness, scalability and non-local generalization capabilities. The outcome (e.g., reaction yield, solubility, etc.) of a specific set of experimental parameters (e.g., concentration, temperature, etc.) can be emulated by drawing a predictive sample from the BNN, conditioned on these parameters. This approach provides a viable avenue to benchmarking experiment planning algorithms in the presence of noise and on realistic, experimentally-derived response surfaces.
Following this idea of emulating experimental response based on real data, we introduce Olympus, a comprehensive software package that provides the possibility to probe the performance of experiment planning strategies on emulated experimental surfaces collected from experiments in chemistry and materials science. Olympus implements a common interface to 18 different experiment planning strategies and thus simplifies the implementation of closed-loop autonomous workflows. Olympus further provides a collection of 10 experimental datasets for which emulators have been trained to serve as a standard set of benchmarks, and a collection of 23 analytical surfaces which can be modulated by different sources of stochastic noise. An automated benchmarking process that determines the most efficient planner for a given application is available. As such, Olympus provides the means to run comprehensive comparisons of novel optimization algorithms and planning strategies to existing ones, allowing to identify the strengths and limitations of individual tools for various scientific discovery tasks. Its capacity to construct probabilistic approximations to experimental surfaces from collected data, modeling both the expected response and the noise modulations, makes Olympus a realistic benchmark suite without the need for excessive and resource demanding experimentation.
In the following, we summarize the datasets and emulators available through Olympus as well as the experiment planning strategies for which intuitive yet flexible interfaces have been implemented. We further highlight the application programming interface of Olympus, demonstrate how individual planners can be accessed and comprehensive benchmarks constructed with only a few lines of code. We conclude by providing a performance baseline comprised of a uniform random search and invite the community to develop and demonstrate more efficient experiment planning strategies on the Olympus benchmarks.

II. BACKGROUND AND RELATED WORKS
While the goal of an optimization task is usually well defined, the setting in which this task is approached might differ from application to application. Thus, the applicability of optimization strategies to certain tasks can be assessed based on multiple criteria, which are designed to highlight strengths and shortcomings of individual strategies on the considered application.  48 and Moses 49 offer benchmarking functionalities for de novo molecular design. These examples provide datasets which aim to model realistic abstractions of the targeted applications on which optimization algorithms can be benchmarked. Yet, the requirements of comprehensive benchmarking frameworks go beyond realistic use cases and also include: (i) intuitive interfaces to interact with these datasets, (ii) interfaces to established algorithms to benchmark, (iii) tools to store and analyze collected results, and (iv) the flexibility to allow the community to extend the framework with additional datasets and algorithms.
Tab. I reports a set of currently available benchmarking toolkits for different applications. Coco is a platform for the systematic comparison of real-parameter global optimizers. 33 It provides benchmark function testbeds, experimentation templates which are easy to parallelize, and tools for processing and visualizing data generated by one or several optimizers. Coco focuses on runtime as the central performance measure and optimization tasks on continuous domains with dimensionalities beyond those typically encountered in chemistry and materials science. The OpenAI Gym offers a series of environments and tasks to test reinforcement learning algorithms. 34 Sherpa is a Python toolkit for hyperparameter tuning of machine learning models. 35 As such, Sherpa offers the automated optimization of hyperparameters via a choice of hyperparameter optimization algorithms including Bayesian optimization, evolutionary approaches and Bandit/Early-stopping schemes. Sherpa orchestrates the entire optimization process and results can be visualized in a comprehensive dashboard. However, Sherpa does not provide synthetic or noisy benchmark cases. Optuna is another toolkit that focuses on the optimization of hyperparameters for machine learning models. 36 In contrast to Sherpa, Optuna implements a define-by-run interface for a dynamic construction of search spaces. However, it also does not provide benchmark cases. Pygmo is a library for massively parallel optimization, which provides a unified interface to a number of gradient and heuristic based optimization algorithms, as well as to synthetic benchmark problems. Pygmo also provides algorithms and benchmarks for constrained and multi-objective optimization problems.
The aforementioned software packages have been developed with ML applications in mind. Summit, however, provides a selection of chemically motivated virtual benchmarks and a selection of experiment planning strategies. Although the application space of Summit is heavily focused on reaction optimization, it targets a realistic modeling of its use cases via physical and statistical models. In contrast, Olympus is tailored to the needs of optimization in a broader range of experimental disciplines, including self-driving laboratories and autonomous experimentation workflows. Specifically, it constitutes a framework to assess the algorithmic performance of data-driven experiment planning strategies in the context of autonomous experimentation for chemistry and materials science. It targets optimization tasks in chemistry and materials science, where the number of parameters to optimize is typically smaller than 10. To serve this purpose, Olympus provides interfaces to optimization algorithms commonly used for experiment planning tasks and offers interfaces to noisy emulators of experimental optimization tasks. In addition, the benchmaking capabilities of Olympus are open to be extended by the community who can contribute their own datasets (see Sec. IV.F).

III. PACKAGE OVERVIEW
Olympus is a modular software package that allows user interactions at different levels and can be used for data-driven experimentation as well as benchmarking experiment planning strategies. With this modularity, Olympus allows for both beginner and expert use and enables performing several optimization and benchmarking tasks in a few lines of code. Some common usecase scenarios are detailed in Sec. IV, including (i) the High-level overview of Olympus and its four core modules: (i) planners, which provide interfaces to common or custom optimization algorithms for experiment planning, (ii) surfaces, which constitute standardized interfaces to established synthetic benchmark surfaces, (iii) emulators, which describe a set of ML models trained to reproduce experimentally derived response surfaces encountered in chemistry and materials science, and (iv) datasets, which form a collection of experimental campaigns. All four core modules offer the possibility to implement and add custom methods and data.
use of different experiment planners for an autonomous workflow, (ii) benchmarking an experiment planner on an emulator, and (iii) constructing an emulator from a user-provided dataset. At the heart of Olympus are four modules, planners, surfaces, datasets, and emulators, which are highlighted conceptually in this section and in Fig. 1.
The planners module (see Sec. III.A) provides a consistent interface to 18 different experiment planning strategies via its core Planner class. Olympus translates a standardized access protocol to the interfaces of individual planners, making it easy to switch the experiment planning strategy of an autonomous workflow. This module also provides the basis for integrating customized algorithms into the package. Available planners are listed in Table II. The surfaces module provides a set of synthetic response surfaces, which are functions commonly used to evaluate and compare the performance of optimization algorithms. Similar to the planners module, a convenient Surface class allows to easily retrieve the desired analytical surface. Available surfaces are listed in Table IV. While these surfaces return deterministic function evaluations by default, it is possible to pass a noise object that results in stochastic evaluations, as shown in Sec. IV.B.
The datasets module in Olympus offers 10 core experimentally derived datasets from chemistry and materials science. These datasets vary in size and represent optimization tasks with dimensionalities from 3 to 6. The core class of this module is Dataset, which allows for retrieval and manipulation of the desired dataset. Available datasets are listed in Table III. Users can also load their own dataset, which can then be used to benchmark experiment planning strategies for the specific problem of interest. Furthermore, users can share their datasets with the commu-nity by uploading it to the Olympus repository of userprovided datasets at https://github.com/aspuru-guzikgroup/olympus datasets. Any user can then download these additional datasets to be used in Olympus via the same interface used for the core datasets.
The emulators module provides access to probabilistic models trained on the core Olympus datasets, reproducing the experimental responses of the corresponding experiments. With its Emulator class, this module also offers a high-level interface for the training of such probabilistic models on user-provided datasets. In this spirit, emulators constitute stochastic response surfaces resembling those encountered in real-life applications, thus allowing to benchmark experiment planning strategies on close-to-reality optimization tasks.

A. Summary of included planners
This section details the types of experiment planning strategies and algorithms available in Olympus and listed in Table II. More information about each specific planner can be found on the online documentation.
Gradient approaches use derivative information (gradient or Hessian) at the current point to determine the location of the next point to be evaluated. Such strategies are efficient on convex optimization problems, but are not guaranteed to find the global optimum on non-convex surfaces. 75,76 Most gradient-based approaches condition both the stepping direction and the step size on the local gradient. The numerical approximation of gradients generally poses a challenge in the context of experimentation where the response surface is subject to noise. Nevertheless, gradient-based search strategies have been reported for the optimization of some chemical processes. 77 Grid-like searches constitute a more common approach to experiment planning. [68][69][70] These strategies define a set of selected parameter points in the parameter space to be evaluated at the start of the optimization campaign. At every step of the campaign, the next point to be evaluated is chosen deterministically. Although grid-like searches mitigate the locality issue of gradient approaches and can reliably identify global optima, their cost scales exponentially with the dimensionality of the parameter space. Alternatives to standard full grid approaches involve the use of low discrepancy sequences, such as Latin Hypercube or Sobol Sequence, to sample more effectively high dimensional spaces. The discrepancy of a sequence is considered to be low if the proportion of points falling into an arbitrary subset of the considered parameter domain is roughly proportional to the measure of this subset. Low discrepancy sequences are also known as quasi-random sequences and are commonly used to finding characteristic functions of probability density functions, higher-order moments of statistical distributions, and integration and sampling of high-dimensional deterministic functions. Random Search reduces the correlation between consecutive proposals even further and has been shown to be particularly effective in higherdimensional search spaces. 52,78,79 Evolutionary algorithms are population and heuristicbased approaches inspired by biological evolution. [80][81][82][83][84] Each individual in the population represent a point in the search space, and their fitness corresponds to the objective evaluated at that point. Evolutionary strategies, like CMA-ES, Particle Swarms, and Differential Evolution, evolve a population of candidate solutions simultaneously and generate new candidates based on some heuristics. The population is frequently updated, with better candidates replacing worse performing candidates. 85 Genetic algorithms constitute a subclass of evolutionary strategies which mimic mechanisms such as reproduction, mutation, recombination, and selection to iteratively improve the fitness of a population.
Other heuristic-based approaches are not inspired by biological evolution specifically. For example, Basin Hopping is a two-step approach that uses both local and global searches and is inspired by the energy landscape of atom clusters. 73 Snobfit too combines both local and global approaches and the strategy was designed with the goal of addressing a number of practical challenges. 72 Finally, the Simplex algorithm by Nelder and Mead exploits the geometry of simplices to define an update rule that proposes new points in a downhill direction. 74 Bayesian optimization methods are sequential, modelbased approaches for the global optimization of black-box functions. [86][87][88] The function to be optimized is approximated by a surrogate model that is refined as more data is collected. Based on this model, an acquisition function that evaluates the utility of candidate points can be defined, leading to the balanced exploration and exploitation of the search space of interest. Similar to evolutionary strategies, no gradient information is required and they can be used for the global optimization of blackbox functions. What distinguishes Bayesian optimization approaches are primarily the surrogate model and acquisition functions used. GPyOpt uses a Gaussian process to model the objective function, 50 Phoenics adopts a mixture of Gaussian kernels, 54 and HyperOpt uses a tree-structured Parzen estimator. [51][52][53] We note that the algorithms available in Olympus present different computational scalings with respect to the number of samples collected and the dimensional-ity of the optimization domain. These algorithmic and implementation aspects result in different runtimes and memory requirements, which ultimately affect the applicability of each algorithm to different problems. However, given the typical problem dimensionalities (< 10 parameters) and runtimes (> 10 min per experiment) encountered experimentally, the computational cost of any of the algorithms above described is not expected to be of significant importance in autonomous workflows.

B. Summary of included datasets
Olympus ships with a total of ten core datasets collected from experiments spanning chemistry and materials science. The datasets have either been collected from the literature or were generated in-house. With these datasets, Olympus can construct experiment emulators using probabilistic machine learning models, notably BNNs, to emulate the overall response surface of the considered experiment for an arbitrary choice of parameter values (see Sec. III.C for details). As such, the provided datasets constitute the basis for realistic benchmarks of experiment planning strategies. Table III summarizes core information about each dataset and further details are provided in the supplementary information (see Sec. VII.A). All datasets are collected from experimental campaigns with three to six independently controllable parameters, one property of interest, and contain from a few tens to more than 1,000 data samples. Five datasets are related to the optimization of organic chemistry reactions, one is derived from the calibration of analytical chemistry instrumentation, two address the identification of polymer blends of photovoltaic materials with favorable photodegradation properties, and two are related to the identification of the colorant mixture displaying a chosen target color. This core set of datasets can be extended by community datasets contributed from individual research groups. Details are provided in Sec. IV.F.

C. Summary of included emulators
The experiment emulators offered through Olympus provide a core functionality to benchmarking data-driven experiment planning strategies: the opportunity to query the response of a quasi-experimental surfaces inexpensively within milliseconds. Balancing robustness and prediction accuracy on the data-scarce datasets reported in Sec. III.B, we construct Olympus emulators from feedforward fully connected Bayesian neural networks (BNNs). BNNs constitute probabilistic machine learning models which, contrary to standard neural network, define a distribution of possible target values conditioned on the input features. To this end, the conventional weight and bias parameters of standard neural networks are modeled as distributions themselves and the BNN is trained via Bayesian inference. While, in theory, weights and biases can be modeled by any valid distribution, in practice the distributions are often explicitly modeled via a set of parameters, such as the location and scale of a normal distribution. This approximation can greatly accelerate inference computations and make the training of a BNN overall computationally tractable. In addition to probabilistic, BNN-based emulators, we also provide deterministic, NN-based ones. These emulators return the same target value given a set of input features.
Emulators are trained on 80 % of the data and tested on 20 % using a random split. The training set is furthermore split into training and validation sets for cross-validation. By default, 5 folds are used, but users can choose how many folds to use when creating their own emulators. The test set is used to probe the generalizability of the emulator. Model performances on both training and test sets are shown in Fig. 2 for BNN-based emulators, and in Fig. 10 for NN-based emulators. Emulators are constructed with different choices for hyperparameters, including the number of layers, the number of neurons per layer, activation functions, and others, which can be accessed directly through the emulator objects. Activation functions for the output layer have been chosen to satisfy physical constraints, such as positivity of the property of interest. Other hyperparameters have been manually selected to achieve promising prediction accuracies. We define emulators as accurate if they achieve a Spearman's rank correlation coefficient above 0.90 for both training and test sets, given that a monotonic relationship between predicted and measured values preserves the relative ranking of all extrema. Typical evaluations of trained emulators take less than 1 ms on a standard laptop. This cheap experiment emulation approach enable the large-scale querying of experimental responses and the rigorous benchmarking of data-driven experiment planning strategies.

D. Summary of included analytical surfaces
In addition to experimentally-derived benchmarks, Olympus provides a suite of analytical functions traditionally used to evaluate optimization algorithms (Table IV). These functions include 11 analytic and smooth functions, such as the Branin and Rosenbrock functions, 5 piece-wise constant functions, and 6 Gaussian mixture model functions derived from a parent Gaussian mixture generator. This generator takes a number of dimensions as argument, and draws random means and covariances. By default, a full covariance matrix is drawn, but a diagonal matrix can also be requested. By fixing the random seed, the Gaussian mixture generator creates reproducible surfaces. In fact, the six Gaussian mixture models available have been obtained by fixing the random seed of each mountain-named surface (e.g. Everest)  to the height of the respective mountain peak in meters (e.g. 8848).
For all analytical surfaces in Olympus, it is possible to specify noise to be added to the evaluations. In such a way, the output of these toy surfaces will be stochastic. A few commonly used noise functions, like Gaussian, uniform, and gamma-distributed noise, are already implemented and readily available in Olympus. However, custom types of noise can also be defined by the users and provided to the surface of interest, which will then return noisy evaluations. Note, that noise modulations are currently only supported for experimental responses. However, realistic experiments might also be subject to noise on the preparation of experimental parameters, such as the dispensing of desired amounts of chemicals or controlling the temperature in an experimental setup. In Olympus, users may add noise to the experimental parameters by taking advantage of the Planners interface.
Tab. IV summarizes the synthetic benchmark functions available in Olympus. Further details as well as illustrations of the surfaces are reported in the supplementary information (see Sec. VII.C).

IV. USING OLYMPUS
In this section, we detail the usage of Olympus on selected applications and use cases. A detailed documentation of the package is provided on GitHub. 92

A. Installation and dependencies
Olympus is available for download on GitHub 92 and can be installed via pip and conda.  The installation requires Python 3.6+ with support for numpy and pandas. However, to access specific features of the package, such as running an emulator, using specific experiment planners, or plotting the results of completed campaigns, the installation of additional packages might be required. Details are provided in the documentation. 92

B. Evaluate analytical surfaces
The analytical surfaces in Olympus can be accessed via the olympus.surfaces module or the Surface function, with the latter loading a surface with default argument. from olympus.surfaces import Michalewicz surface = Michalewicz(param_dim=2, m=12) # or, to load with default arguments from olympus import Surface surface = Surface("Michalewicz") The above example defines a surface with deterministic output. However, noise can be added to have a surface instance that returns stochastic evaluations. from olympus.surfaces import Dejong from olympus.noises import GaussianNoise noise = GaussianNoise(scale=0.5) surface = Dejong(param_dim=2, noise=noise) Surfaces can then be evaluated sequentially or in batches as follows.

C. Run a simulated campaign
The datasets available in Olympus can be accessed via the Dataset class, using the keyword associated with each dataset.
# load an Olympus dataset from olympus import Dataset dataset = Dataset("snar") Neural Network (NN) or Bayesian Neural Network (BNN) based emulators are already available in Olympus for all datasets provided. However, the user also has the freedom to train new emulators by customising the models provided in the olympus.models module. from olympus import Emulator emulator = Emulator(dataset="snar", model="BayesNeuralNet") # or customize the model from olympus.models import BayesNeuralNet model = BayesNeuralNet(hidden_depth=4, out_act="sigmoid") emulator = Emulator(dataset="snar", model=model) All algorithms described in the previous section can easily be accessed from the olympus.planners module or via the Planner function. While the former allows the user to choose specific settings for each planner, the latter loads them with default arguments. from olympus.planners import Gpyopt planner = Gpyopt(goal="minimize", model_type="GP_MCMC", acquisition_type="EI_MCMC") # or, to load with default arguments: from olympus import Planner planner = Planner("Gpyopt", goal="minimize") Once a planning algorithm and an emulator have been defined, it is possible to start a simulated optimization campaign using the optimize method. emulator = Emulator(dataset="snar", model="BayesNeuralNet") planner = Planner("Phoenics", goal="minimize") campaign = planner.optimize(emulator=emulator, num_iter=50)

D. Train custom emulator
With Olympus you can create an Emulator in order to generate a custom emulated response surface for a new dataset. For instance, if you have data for a chemical reaction of interest, for which you would like to optimize the yield, you can load the dataset from a table as follows.
# load a custom dataset from olympus import Dataset import pandas as pd mydata = pd.from_csv("mydata.csv") dataset = Dataset(data=mydata, target_ids=['yield']) After this step, you can load one of the available models from the olympus.models module and pass it to a new Emulator instance, which will allow you to cross-validate and train the emulator. Users can override default model hyperparameters by passing custom values as arguments to the olympus.models module. Once you obtain an emulator with satisfactory performance, you can save it. The ask method takes advantage of the param space attribute present in CustomPlanner, which is a list of dictionaries defining the parameter space over which to optimize. In addition, params and values contain the parameters and associated merits for all previous observations, respectively. These attributes will be needed for any algorithm in which the set of parameters proposed depend on the previous observations. Finally, note ask returns a ParameterVector object, which can be instantiated with an array or dictionary of parameters.
In the above example, an init method is not specified. This is because the following default one is inherited from CustomPlanner. def __init__(self, goal='minimize'): AbstractPlanner.__init__(**locals()) If you would like to initialize your own Planner with more options, you can expand upon the above init method. Note it is required to keep the argument goal and to initialise AbstractPlanner as above. A tutorial on the creation of custom Planner classes with further details on possible customization is available as part of the online documentation.

F. Download/upload community datasets
In addition to the set of core datasets distributed with Olympus, we allow users to share their own datasets with the community. These additional datasets are stored on GitHub and provide an extended set of benchmarks built by the autonomous experimentation community. Olympus provides intuitive command line tools to upload and download these datasets. For instance, to download the excitonics dataset and make it available to your local Olympus installation:

>> olympus download -n excitonics
After the download, the excitonics dataset will be available and you will be able to load it in the same way as the core datasets.
from olympus import Dataset dataset = Dataset("excitonics") Note that, for community-provided datasets, trained Emulator instances are not readily available, such that you will need to train the relevant Emulator.

emulator.save("excitonics_nn_emulator")
If you have a dataset that you think would be a useful benchmark for the community, you can upload it to the Olympus pool of datasets using the Olympus command line tools as follows.

G. Plotting benchmark results
Results collected in several campaigns can be plotted automatically via a comprehensive plotting interface. Plots are generated from a Database object, like the one generated by Olympus when running a benchmark campaign. The following example illustrates the generation of a plot that illustrates the results of the executed benchmark. The generated plot is shown in Fig. 3

V. A RANDOM SEARCH BASELINE
The promise of data-driven strategies to identify desired parameter choices for experimental setups in closedloop workflows is based on their capacity to condition design choices on feedback collected from previous experiments. The performance of a data-driven experiment planning strategies can, for example, be quantified via the number of experiments required to locate parameter values which yield the desired experimental outcomes. In this paragraph, we aim to provide a baseline for the performance of data-driven experiment planning strategies which can indicate the degree of difficulty that each constructed emulator poses to a planner.
We construct the baseline by probing the performance of the random search strategy on each of the emulators. Random search as an experiment planning approach can be considered to be a naive strategy as it does not leverage any feedback collected in previous experiments. Parameter choices for future experiments are generated by drawing random samples from uniform distributions supported within the allowed ranges of each of the parameters. As such, the suggested parameter values are independent from one another and are not influenced by any past measurements. Data-driven strategies for experiment planning that do condition their design choices on previous feedback are therefore expected to outperform the random search baseline. The magnitude by which random search is outperformed can be used as a proxy to quantify the efficiency of the planner for each emulator.
The provided baseline consists of 100 independent campaigns with 10,000 emulator evaluations per campaign for each of the emulators. Note that results from random baselines are not shipped with the software package and need to be downloaded separately. Random baselines are available on Github 92 and can be downloaded from there or via the Olympus command line interface.

# download random baseline >> olympus baseline --get
All parameters are generated with the random search planner. The results of these baseline calculations can be accessed through Olympus as follows # load the baseline from olympus import Baseline base = Baseline() summary = base.get('snar', kind='summary') campaigns = base.get('snar', kind='campaigns') database = base.get('snar', kind='db') While the full traces of the random search baselines are available through Olympus, we suggest to compare the achieved feedback after a specified set of emulator evaluations. We propose to use [1, 3, 10, 30, 100, 300, 1000, 3000, 10000]. This choice is inspired by the fact that most experimental campaigns reported for autonomous experimentation platforms are limited to about 100 experiments. This set of evaluation numbers allows to estimate the performance of each planner in the regime of little data (∼10 evaluations), medium data (∼100 evaluations), abundant data (∼1,000 evaluations) and asymptotic behavior (∼10,000 evaluations).
The results of random search against all core Olympus datasets are illustrated in Fig. 4. Based on these results, we can identify a subset of the emulated surfaces for which random search reaches near-optimal property values in a small number of evaluations. This subset includes photobleaching pce10, photobleaching wf3, colormix bob, colormix n9, snar, and hplc n9. Given that random search does not leverage any feedback from collected measurements for future decisions, these emulated surfaces might be considered to be the simpler cases for a more sophisticated experiment planner. The remaining surfaces, however, including alkox, fullerenes, nbenzylation, and suzuki, might pose a bigger challenge to experiment planners given that asymptotic property values are only achieved after a significant number of random evaluations or not even reached after 10,000 evaluations. Numerical values for the baseline are available through the Olympus package. 92 We hope that the results from this random search can serve as a baseline to compare the performance of different experiment planning strategies such as those already included in Olympus, but also new strategies developed by the community.

VI. CONCLUSION
Standardized and challenging benchmarks are necessary to facilitate precise comparison between different approaches and allow scientific and technological advances to be quantified. Widely used benchmark sets like MNIST and CIFAR-10/100, 39,40 which are comprised of images of hand-written digits and various objects and animals, respectively, have allowed to measure constant advances in machine vision, providing clear feedback to the community on the most promising research directions. MoleculeNet, a collection of quantum mechanical, physical, biophysical and physiological molecular properties, provides a similar example in the field of chemistry and biophysics. 41 Olympus constitutes an orthogonal set of benchmarks, with a focus on optimization and experiment planning in chemistry and materials science, as opposed to prediction. It provides a framework with the potential to spark and streamline the development of powerful algorithms and data-driven approaches aimed at efficient experiment planning. To this end, Olympus also provides intuitive interfaces to a variety of experiment planning strategies to simplify their implementation, deployment, and testing in autonomous discovery workflows. With every user being able to supply their own datasets through our standardized interfaces, Olympus also encourages the free exchange of data across the community and promotes the establishment of standard, reproducible optimization challenges. In summary, Olympus provides a unified framework for the deployment and testing of experiment planning strategies. We thus invite the community to take advantage of Olympus in the implementation and testing of novel approaches to autonomous workflows, as well as to share experimental data that can prove valuable in moving this exciting new field forward. In this section we provide a brief summary of each dataset available in Olympus, along with the parameters, objectives and optimization goal.

Alkoxylation
This dataset contains 104 measurements on biocatalytic oxidation of benzyl alcohol by a copper radical oxidase (AlkOx). The effects of enzyme loading, cocatalyst loading, and pH balance on both initial rate and total conversion were assayed. Stock solution were prepared daily and stored over crushed ice. Additional dilutions were done as required using sodium phosphate buffer and immediately discarded after use. The assays were initiated by the addition of Cgr AlcOx and H 2 O 2 to a well-mixed HPLC vial containing all other reaction components. The initial rate was obtained by fitting the concentration of aldehyde to a linear function and reporting the slope. Conversion was calculated by fitting the percent conversion of aldehyde to a linear function and reporting the value at twenty minutes.

Buckminsterfullerene adducts
This dataset is based on the reported production of o-xylenyl adducts of Buckminsterfullerenes (Fig. 6). 90 Three process conditions can be varied to maximize the mole fraction of the desired products X 1 and X 2 . The conditions are temperature, reaction time and the ratio of sultine to C 60 . Experiments were based on a fully factorial design with three factors and six levels, totalling 246 samples.

Parameter
Kind Range Description Objective reaction time continuous [3,31] reaction

HPLC
This dataset reports the peak response of an automated high-performance liquid chromatography (HPLC) system for varying process parameters. 89 The dataset includes 1,386 samples with six parameters and one objective. This dataset reports the yield of undesired product (impurity) in an N-benzylation reaction. 91 The undesired product is the tertiary amine shown in Fig. 7. This dataset includes 73 samples with four parameters and one objective.   This dataset reports the degradation of polymer blends for organic solar cells under the exposure to light. Individual data points encode the ratios of individual polymers in one blend, along with the measured photodegradation of this blend. 14 The dataset includes 1040 samples with four parameters and one objective.  This dataset reports the degradation of polymer blends for organic solar cells under the exposure to light. Individual data points encode the ratios of individual polymers in one blend, along with the measured photodegradation of this blend. 14 The dataset includes 1040 samples with four parameters and one objective.  This dataset reports the environmental factor (E-factor) for the nucleophilic aromatic substitution (S N Ar) reaction shown in Fig. 8. 91 The E-factor is defined as the ratio of the mass of waste to the mass of product. The dataset includes 66 samples with four parameters and one objective.  High-throughput reactions were carried out on the palladium-catalyzed Suzuki cross-coupling between 2bromophenyltetrazole and an electron-deficient aryl boronate (see Fig. 9). Cross-couplings of aryl halide electrophiles bearing non-protected ortho tetrazole substituents are typically carried out under harsh conditions due to the metalchelating nature of the tetrazole moiety, but we have found that the use of electron-rich bidentate phosphines, such as dtbpf, in alcohol solvents, facilitate milder reaction conditions. A wide range of continuous factors were explored in the microscale optimization run, including reaction temperature, Pd(dtbpf)Cl 2 loading, and equivalents of base. This resulted in product yields spanning from 2 to 97 mol %, generating a robust data set for modeling. Parameters and ranges are summarized in Tab. XIV Kind Range Description Objective temperature continuous [75,90] temperature The datasets in Olympus can be emulated using probabilistic Bayesian neural network (BNN) models or deterministic neural network (NN) models. Fig. 2 and Fig. 10 show the correlation between predicted and measured target data points for BNN and NN models, respectively. All emulators display a Spearman's rank coefficient above 0.90, for both training and test sets. Train/test splits were performed at random, but using a fixed random seed for reproducibility; 80% of the data was used for training and 20% for testing. Hyperparameter optimization was performed manually using 5-fold cross validation. The details of all hyperparameters used in these models are stored in the respective Emulator objects that can be loaded from Olympus. Performance on the training set (80% of data; blue markers) is shown in the top-left corner of each plot and on blue background; performance on the test set (20% of data; pink markers) is shown in the bottom-right corner of each plot and on pink background. R 2 is the coefficient of determination, ρ is the Spearman's rank correlation coefficient, and RMSE is the root-mean-square error. Emulators trained on datasets introduced in this study are indicated with *. Fig. 11 illustrates the analytical benchmark surfaces available in Olympus. With the exception of Branin, which is restricted to a two-dimensional input space, all other surfaces may be defined in any dimension. Note that all surfaces operate on the unit hypercube, X ∈ [0, 1] d . Olympus internally scales the inputs to be in agreement with