Efficient hyperparameter tuning for kernel ridge regression with Bayesian optimization

Machine learning methods usually depend on internal parameters -- so called hyperparameters -- that need to be optimized for best performance. Such optimization poses a burden on machine learning practitioners, requiring expert knowledge, intuition or computationally demanding brute-force parameter searches. We here address the need for more efficient, automated hyperparameter selection with Bayesian optimization. We apply this technique to the kernel ridge regression machine learning method for two different descriptors for the atomic structure of organic molecules, one of which introduces its own set of hyperparameters to the method. We identify optimal hyperparameter configurations and infer entire prediction error landscapes in hyperparameter space, that serve as visual guides for the hyperparameter dependence. We further demonstrate that for an increasing number of hyperparameters, Bayesian optimization becomes significantly more efficient in computational time than an exhaustive grid search -- the current default standard hyperparameter search method -- while delivering an equivalent or even better accuracy.


I. INTRODUCTION
With the advent of datascience 2,10 , data-driven research is becoming ever more popular in physics, chemistry and materials science 3,11,16,30,35 . Concomitantly, the importance of machine learning as a means to infer knowledge and predictions from the collected data is rising. Especially in molecular and materials science, machine learning has gained traction in the last years and now frequently complements other theoretical or experimental methods 4,[6][7][8][13][14][15][16][23][24][25][26]35 .
The effective use of machine learning usually requires expert knowledge of the underlying model and the problem domain. A particular difficulty that is sometimes overlooked in current machine learning applications is the optimization of internal model parameters, so called hyperparameters. Non-expert data scientists often spend a long time exploring countless hyperparameter and model configurations for a given dataset before settling on the best one. However, the best settings for these hyperparameters change with different datasets and dataset sizes. The resulting machine learning model is frequently only applicable to one specific problem setting. New data requires hyperparameter re-optimization. Thus, expert use of machine learning is often a costly endeavor -both in terms of time and computational budget.
Optimal machine learning is achieved with the set of hyperparameters that optimize some score function f , such as mean absolute error (MAE), root mean squared error (RMSE), or coefficient of determination (R 2 ). The score function reflects the quality of learning given the dataset composition and training set size, and is itself an unknown function of all n hyperparameters f =f ({z}). Hyperparameter tuning comprises a set of strategies to navigate the n-dimensional phase space of hyperparameters and pinpoint the parameter combination that brings about the best performance of the machine learning model.
The most commonly used form of automated hyperparameter tuning is grid search. Here the hyperparameter search space is discretised and an exhaustive search is launched across this grid. This brute-force technique is widely employed in machine learning libraries: the algorithm is easy to parallelise and is guaranteed to find the best solution. However, the number of possible parameter combinations to be explored grows exponentially with the number of hyperparameters n. Grid search becomes prohibitively expensive if more than 2 or 3 hyperparameters need to be optimized simultaneously.
One strategy to overcome this problem is to decompose the n-dimensional hyperparameter space into n separate 1-dimensional spaces around each parameter, assuming no interdependence between them. The set of 1-dimensional grid searches can then be solved in turn, while keeping the values of other hyperparameters fixed. This approach is rarely pursued since hyperparameters are typically codependent: optimal solutions in each dimension depend on the fixed values and there is no guarantee that the correct overall solution can be found.
Hyperparameter tuning is a classic complex optimization problem, where the objective score function f ({z}) has an unknown functional form, but can be evaluated at any point. An algorithm designed to address such tasks is Bayesian optimization 27 . It is widely applied in the machine learning community to tune hyperparameters of commonly used algorithms, such as random forest, deep neural network, deep forest or kernel methods 17,18,[32][33][34] , which are evaluated on a wide range of standard datasets from the UCI machine learning repository 5 . However, it is not yet common to apply Bayesian optimization to machine learning problems in the natural sciences, where high-dimensional hyperparameter spaces are frequently encountered.
In this study, we demonstrate the advantages of applying Bayesian optimization to a hyperparameter opti-mization problem in machine learning of computational chemistry. Our test case is a kernel ridge regression (KRR) machine learning model that maps molecular structures to their molecular orbital energies 29 . We represent the molecular structures with two different descriptors, the Coulomb matrix 22 and the many-body tensor representation 12 . The KRR method itself requires the optimization of two hyperparameters and one kernel choice. The Coulomb matrix is hyperparameter-free, but the many-body tensor representation adds up to 14 more hyperparameters to the model. Through pre-testing, we reduce this number to 2 hyperparameters that affect the model the most. The largest hyperparameter space we encounter in this approach is therefore 4 dimensional.
In previous work, we established the optimal kernels for the Coulomb matrix and the many-body tensor representation 29 . We set KRR hyperparameters by means of grid search for three different molecular datasets, after we had optimized the hyperparameters of the molecular descriptors manually beforehand 29 . Here, we take the more rigorous approach and combine the optimization of descriptor and model parameters into a hyperparameter search of up to four dimensions. Higher-dimensional searches are easily feasible with BOSS, but four dimensions already illustrate the efficiency of Bayesian optimization over grid search.
The objective of this manuscript is to evaluate the effectiveness and accuracy of BOSS against grid search in optimization problems with up to four dimensions. We show that already in 4 dimensions, BOSS outperforms grid search in terms of efficiency. In addition, we present score function landscapes generated across the hyperparameter phase space, and analyse them as a function of dataset size and composition. These landscapes provide insight into how the behavior of the machine learning model changes across a range of possible model configurations. Such insight helps machine learning practitioners to choose possible starting points for similar optimization problems.
The manuscript is organized as follows. Section II introduces the basic principle of machine learning with kernel ridge regression and illustrates how molecules are represented to the algorithm. The concept of hyperparameter tuning with grid search and Bayesian optimization is explained. In Section III, these two methods are applied to adjust the hyperparameters for our kernel ridge regression model which predicts molecular energies of three molecular datasets. We visualize and discuss our results. Conclusions and outlook are presented in the last section.

A. Machine learning model
We employ kernel ridge regression (KRR) to predict molecular orbital energies of three distinct datasets of organic molecules: the QM9 dataset of 134k small or-ganic molecules 19 , the AA dataset of 44k conformers of proteinogenic amino acids 21 and the OE dataset of 62 k organic molecules 28 . The three datasets have been described in detail in our previous KRR work and we refer the interested reader to Ref. 29 or the original references for each dataset for more information.
In KRR, a scalar target property, here the energy of the highest occupied molecular orbital (HOMO), is expressed as a linear combination of kernel functions k(M , M ) M i is the descriptor for molecule i and the sum runs over all training molecules. w i are the regression weights that need to be learned.
In the scope of this work, we employ two kernel functions: the Gaussian kernel which is a function of the Euclidean distance between two molecules M , M , and the Laplacian kernel which is based on the 1-norm to measure similarity between two molecules. In both cases, γ is the kernel width that determines the resolution in molecular space. The regression parameters w i are obtained from the minimization problem where E ref i are the known reference HOMO energies in the dataset, K is the kernel matrix (K i,j := k(M i , M j )) and w is the regression weight (w i ) vector. The scalar α controls the size of a regularization term and penalizes complex models with large regression weights over simpler models with small regression weights. Equation 4 has an analytic solution that determines w. We here explicitly distinguish between the regression weights w i and the hyperparameters of the machine learning model. The regression weights grow in number with increasing training data size and are given in closed mathematical form by eq. 5. Conversely, the hyperparameters are finite in number. In KRR, for example, the number of hyperparameters is fixed to two: α and γ. These two hyperparameters can assume any value within certain sensible ranges. Their optimal values have to be determined by an external hyperparameter tuning procedure. In addition, there are model-specific choices, which could be interpreted as special hyperparameters, that can only assume certain values. For KRR, this would be the choice of kernel.

B. Molecular representation
One important aspect in machine learning is the representation of the input data to the machine learning algorithm. Here we employ the Coulomb matrix (CM) 22 and the many-body tensor representation (MBTR) 12 . We use the DScribe package 1 to generate both descriptors for the datasets in this work.
The entries of the CM are given by The CM encodes the nuclear charges Z i and corresponding Cartesian coordinates R i of all atoms i in molecule M . The off-diagonal elements represent a Coulomb repulsion between atom pairs and the diagonal elements have been fitted to the total energy of the corresponding atomic species in the gas phase. To enforce permutational invariance, the rows and columns of the CM are sorted with respect to their 2 -norm. The MBTR encodes molecular structures by decomposing them into a set of many-body terms (species, interatomic distances, bond angles, dihedral angles, etc.), as outlined for the example of a CO 2 molecule in Figure 1. Each many-body level is represented by a set of fixed sized vectors. The symbol k enumerates the manybody level. We here include terms up to k=3. One-body terms (k=1) encode all atom types (species) present in the molecule. Two-body terms (k=2) encode pairwise inverse distances between any two atoms (bonded and non-bonded). Three-body terms (k=3) add angular distributions for any triple of atoms. A geometry function g k is used to transform each configuration of k atoms into a single scalar value. These scalar values are then Gaussian broadened into continuous representations D k : The σ k 's are the feature widths for the different klevels and x runs over a predefined range [x k min , x k max ] of possible values for the geometry functions g k . For k = 1, 2, 3, the geometry functions are given by g 1 (Z l ) = Z l (atomic number), . For each possible combination of chemical elements present in the dataset, a weighted sum of distributions D k is generated. For k = 1, 2, 3, these final distributions are given by where the sums for l, m, and n run over all atoms with atomic numbers Z 1 , Z 2 and Z 3 . w k are weighting functions that balance the relative importance of different k-terms and/or limit the range of inter-atomic interactions. For k = 1, usually no weighting is used (w l 1 = 1). For k = 2 and k = 3 the following exponential decay functions are implemented in DScribe The parameter s k effectively tunes the cutoff distance. The functions MBTR k (x) are then discretized with n k many points in the respective intervals [x k min , x k max ].

C. Number and choice of hyperparameters
In this section we review the hyperparameter types in our CM-or MBTR-based KRR models and motivate our choice for which hyperparameters to investigate in more detail. Table I gives an overview over all hyperparameters in this work. In total there would be 3 hyperparameters to optimize for CM-KRR and 17 for MBTR-KRR. Previous work has shown that some hyperparameters have little effect on the model. They can be preoptimized and set as defaults for the optimization of the remaining hyperparameters. We will explain this choice in more detail in the following. The KRR method has 3 hyperparameters. γ and α are continuous variables and need to be optimized. Conversely, the kernel choice can only assume certain finite values (0 and 1 in our case). We found in previous work 29 that the Laplacian kernel is more accurate for the CM representation and the Gaussian kernel for the MBTR. We therefore fix this choice also in this work and only optimize the 2 parameters γ and α.
The CM has no hyperparameters. Conversely, the MBTR introduces many. This indicates that the MBTR offers a more complex representation that could lead to faster learning for the same machine learning algorithm. This is indeed what we observed in our previous work comparing CM-KRR and MBTR-KRR 29 . However, the learning improvement comes at the price of a large number of hyperparameters, that need to be optimized to achieve a good model.
As Tab. I illustrates, MBTR introduces a total of 14 hyperparameters. In previous work 1, 29 , we found that our KRR models were not sensitive to the grid discretization parameters. We therefore fix the grids to a range [0, 1] for k = 2 and [-1, 1] for k = 3, with 200 discretization points, and leave them unchanged for the rest of this study. We also found that the k=1 term does not improve the learning for the three datasets under investigation here 29 and therefore omit the σ 1 hyperparameter.
The molecules in our datasets are relatively small. We therefore do not need to limit the range of the MBTR and set s 2 = s 3 = 0. This leaves us with 2 hyperparameters, the two feature widths σ 2 and σ 3 . The minimum and maximum values of all hyperparameters in this study are listed in Table II.

D. Hyperparameter tuning
Let z be a set of n hyperparameters z = z 1 , z 2 , ..., z n , the boundaries of which define the hyperparameter search domain Z, such that z ∈ Z. The score function f (z) ∈ Z is a black-box function defined within the phase Hyperparameter lower bound upper bound α 1e-10 1 γ 1e-10 1e-3 σ2 1e-6 1 σ3 1e-6 1 space Z. The aim of hyperparameter optimization for a given machine learning model is to find the set of hyperparametersẑ that provides the best model performancê y, as measured on a validation set: The search forẑ requires sampling the phase space Z through repeated f (z) evaluations. Unfortunately, computing the objective function can be expensive. For each set of hyperparameters, it is necessary to train a model on the training data, make predictions on the validation data, and then calculate the validation metric. With an increasing number of hyperparameters, large datasets and complex models, this process quickly becomes intractable to do by hand. Therefore, automated hyperparameter tuning methods are indispensable tools for model building in machine learning.
In this study, we compare two approaches for hyperparameter tuning, grid search and Bayesian optimization. The former is guaranteed to find the optimal solutionẑ, the latter is a statistical model with a high probability of findingẑ. Our score function f (z) is the mean absolute error (MAE) on the prediction of HOMO energies, with units in eV.
FIG. 2. Working principles of hyperparameter tuning methods. a) In grid search, the hyperparameter space is mapped onto a grid, and the score function is evaluated for each combination of hyperparameter values. b) In Bayesian optimisation, a Gaussian process model is built to simulate the MAE landscape. The landscape is refined by sampling the score function across the hyperparameter space.

Grid Search
Grid search employs a grid of evenly spaced values for each hyperparameter z to discretise the entire phase space Z, as illustrated in Figure 2 a). The train-predictevaluate cycle is then run automatically in a loop to evaluate the MAE for all hyperparameter configurations on the grid. Here, we rely on the scikit-learn implementation of KRR, but we eschew its native grid search function 'sklearn.model_selection.GridSearchCV' in favour of own algorithms designed speficially for explicit evaluation of computational cost. Algorithms 1 and 2 demonstrate how the 2D and 4D grid searches were performed with different materials descriptors.
When the CM is used as molecular descriptor, we first compute a CM for each molecule in the dataset, as described in Algorithm 1. We shuffle the CMs and distribute them into five equally sized groups, along with their corresponding HOMO energies. We then set up a grid for the KRR hyperparameters α and γ , with 11 values for each hyperparameter resulting in 121 grid points. Then, a 5-fold cross-validated KRR routine is performed for each possible combination of α and γ. As usual in cross-validation, one of the split-off groups is defined as validation set and the remaining four groups combined serve as training set. A KRR model with the Laplacian kernel and the current combination of α and γ is trained on the training set and prediction error MAE (f (α, γ)) is computed on the validation set. For the next round of cross-validation, we define another group as validation set and the remaining four groups as training set to obtain a second MAE. In the same manner, more rounds of cross-validation are performed, until each of the 5 groups has been used as validation set once and as training set 4 times, resulting in 5 MAEs. The average of these serves as an MAE measure of how well the current combination of KRR hyperparameters performs for that grid point. Once we obtain an average MAE for each combination of α and γ on the grid, we can pick the combination that results in the lowest MAE.
For the MBTR, we investigate 2D and 4D hyperparam-eter optimizations. The 2D case proceeds analogously to the CM KRR hyperparameter search, after we pick values for σ 2 and σ 3 , and hold them fixed. The algorithm for the 4D case is depicted in Fig. 2. We loop over σ 2 and σ 3 on a logarithmic grid of six points each each and build the MBTR for the dataset for those values. Like the CM, this MBTR is then split into 5 subsets for cross validation. We then enter the α and γ optimization as we do in the 2D grid search.

Bayesian Optimization
With Bayesian optimization, we build a surrogate model for the MAE across the search domain, then iteratively refine it until convergence 9,20,27 . Once the MAE surrogate landscape is known, it can be minimized efficiently to find the optimal choice of hyperparameters at its global minimum location. In this work, we rely on the Bayesian Optimization Structure Search (BOSS) package 31 for simple and robust Bayesian optimization in physics, chemistry and materials science.
The Bayesian optimization algorithm, illustrated in Figure 2b), features a two-step procedure of Gaussian process regression (GPR), followed by an acqusition function. In the GPR, the MAE surrogate model is computed as the posterior mean of a Gaussian process (GP), given MAE data. This step produces an effective landscape of the MAE in hyperparameter space, which can be viewed and analysed. While the posterior mean is the statistically most likely fit to MAE data, the computed posterior variance (uncertainty) indicates which regions of the hy-perparameter space are less well known. Both the mean and variance are then used to compute the eLCB acqusition function. The global minimum of the acqusition function points to the combination of hyperparameters in phase space to be tested next. Once this point is evaluated, the resulting MAE is added to the dataset and the cycle repeats. With each additional datapoint, the MAE surrogate model is improved. The method features variance and lengthscale hyperparameters encoded in the radial basis set (RBF) kernel of the Gaussian process, but these are autonomously refined along with the GPR model.
In this active learning technique, the data is collected at the same time as the model training is perfomed. The acquisition strategy combines data exploitation (searching near known minima) and exploration (searching previously unvisited regions of phase space) to quickly identify important regions of hyperparameter phase space where MAE is low. This allows us to identify the optimal combination of hyperparameters with relatively few MAE evaluations.
BOSS requires only the range of hyperparameters as input, so it can define the phase space domain before it launches a fully automated n-dimensional search for the best combination of hyperparameters. Acquisitions are made according to Algorithms 3 and 4, which serve as the evaluation function. For each new acquisition, the molecular descriptor (either CM or MBTR) is computed and KRR with 5-fold cross validation is performed. The average MAE is returned to BOSS to refine the MAE surrogate model and to perform the next acquisition.
Once the n-dimensional MAE surrogate models are converged, we can evaluate model accuracy qualitatively and model predictions quantitatively. Model predictions are summarised by the location of the global minimum in hyperparameter spaceẑ, and its value in the surrogate model µ(ẑ). Although µ(z) values should be close to the true f (z) score function values, we additionally evaluate f (ẑ) to validate the match throughout the convergence cycle. This way we can determine that the model does converge, and that it coverges to the true function f (z). In this section, we examine the performance of both grid search and BOSS in tuning the hyperparameters of KRR-based machine learning models for predicting molecular HOMO energies based on molecular structures. An important objective is to establish whether BOSS is capable of finding similar hyperparameter solutions as the grid search algorithm, which is guaranteed to succeed. BOSS solutions ofẑ, f (ẑ) and µ(ẑ) are presented in Table V alongside equivalent results from the grid search.
In this study we consider the case of CM and MBTR descriptors, which changes the dimensionality and complexity of the search. Concurrently, we analyze the effect of dataset type and training set size on the hyperparameter tuning procedure and optimal solutions. Finally, we compare timings of grid search and BOSS to estimate which approach is more efficient in each of the above described settings.

A. KRR-CM hyperparameter tuning
The CM materials descriptor has no parameters and the KRR kernel choice is clear. The MAE is thus a 2dimensional function of KRR hyperparameters: regularization strength α and kernel width γ. Figure 3 shows the 2-dimensional landscapes of MAE of a CM-KRR model for the QM9 dataset and a training set size of 2k. Panel a) depicts the grid search and panel b) the BOSS results, with logarithmic axes for clarity. All BOSS searches in this and subsquent sections are converged with respect to the number of acquisitions. The detailed convergence analysis will be presented in Section III D. It is clear that BOSS and grid search produce qualitatively similar MAE landscapes. The grid search landscape is naturally "pixelated", because it only has a 10×10 resolution in Fig. 3. Conversely, BOSS is not constrained to a grid and the Gaussian process in BOSS interpolates the MAE between the BOSS acquisitions.
Visualizing the MAE landscape tells us that the optimal parameter region has a complex and at first sight non-intuitive shape. The lowest MAE values in both methods lie on the diagonal of the hyperparameter landscape and along a horizontal line at the bottom of the landscape (for which γ is ∼10 −3 ). Thus, the optimal parameter space has two parts: a co-dependent part, in which the choice of α and γ is equally important for the KRR accuracy and a quasi one dimensional part, in which only the choice of γ matters and α can assume any value. We will return to the analysis of this hyperparameter behavior in Section III C.
Grid search and BOSS locate almost identical MAE minima: 0.277 eV for grid search and 0.270 eV for BOSS. The optimal hyperparameter solutionẑ found by BOSS and grid search are found in the same hyperparameter region of low log(α) values and high log(γ) values. We conclude that BOSS is capable of reproducing grid search solutions to 1% accuracy in this case. Since BOSS and grid search produce qualitatively and quantitatively similar results, we will only consider BOSS MAE landscapes in the remaining discussion.  Figure 4 illustrates BOSS MAE landscapes for increasing training set sizes of 1k, 2k, 4k and 8k molecules taken from the QM9 dataset. Tables IV and V summarize the optimal hyperparameter search. We find that the MAE landscapes become more homogeneous with increasing dataset size and that the optimal MAE is decreasing (as expected). The optimal solution of hyperparametersẑ vary somewhat, but are always found within the horizontal region at the large γ values. The slight variation is an indication of the flat MAE landscapes in the aforementioned triangular solution space, on which many hyperparameter combinations yield low MAE values.
Next, we compare the hyperparameter landscapes across the three different molecular datasets QM9, AA and OE, as shown in Fig. 5. The model was trained on a subset of 4k molecules for each dataset. The dependence of MAE on the two KRR hyperparameters α and γ is the same for all three datasets, with the the triangular region of optimal hyperparameters and flat MAE minima. However, the detailed dependence varies qualitatively with the dataset. For AA, the triangle is filled and we find a wide range of optimal hyperparameters in the landscape. This indicates that the KRR model is not overly sensitive to α and γ for KRR-CM learning of the AA dataset. Conversely, for OE, the diagonal is more pronounced than the large γ solution that dominates for QM9. The horizontal line of optimal hyperparameters has not yet developed for this training set size, indicating that broad feature widths γ only lead to optimal learning when the regularization α is large. The optimal combinationẑ of hyperparameters differs across the three datasets (see Tables IV -VII for optimal solutionsẑ and corresponding MAEs f (ẑ) and µ(ẑ)). Also the optimal MAE values vary considerably across the three datasets. This in accordance with our previous work, which revealed that the predictive power of KRR inherently depends on the complexity of the underlying dataset 29 .

B. KRR-MBTR 4D hyperparameter tuning
We now consider the results of the 4D optimization problem, where the MBTR is used as molecular descriptor. BOSS builds a 4-dimensional surrogate model MAE(α,γ,σ 2 , σ 3 ), which we compare to the reference 4dimensional MAE landscape produced by grid search. Both 4D landscapes can be analyzed by considering 2dimensional cross-sections. In Figure 6, we compare the BOSS surrogate model to the grid search result. Figure 6 a) and b) illustrate the (log(α), log(γ)) crosssection of the four dimensional MAE landscapes, extracted at the global minimumẑ with optimal σ 2 , σ 3 values. Similar to the MAE landscapes of the 2D optimization problem, the optimal values lie on a diagonal and on a horizontal line at the bottom of the map. A notable difference to the previously discussed 2D case are the lower overall prediction errors. This is in line with our previous finding 29 that the MBTR encodes the atomic structure of a molecule better than the CM.
In Figure 6 c) and d), the 4D MAE landscapes are cut through the (log(σ 2 ), log(σ 3 )) plane, while α and γ are held constant at their optimal values. Here, the optimal MAEs are found only within a small region. In contrast to the KRR hyperparameters, the MAE is barely sensitive to σ 2 and σ 3 , varying only by two decimals throughout the map (about 10% of the value). All combinations of σ 2 and σ 3 are reasonably good choices for learning in this case.
FIG. 6. MAE landscapes from the 4D hyperparameter optimization problem with the MBTR descriptor. Panels a) and b) show 2D slices through the logarithmic (α, γ) plane, while panels c) and d) show 2D slices through the logarithmic (σ2, σ3) plane. In a) and c), the grid search MAE and in b) and d), the BOSS MAE surrogate model µ(x) are presented. Both grid search and BOSS were applied to a subset of 2k molecules taken from the QM9 dataset. Optimal hyperparameters are shown as red stars.
As in the two dimensional CM case, BOSS and grid search reveal qualitatively consistent hyperparameter landscapes and optimal solutionsẑ, f (ẑ) and µ(ẑ) (see Table V). Therefore, we will only discuss MAE landscapes produced by BOSS in the following.
In Figure 7, we present BOSS MAE landscapes for the three different molecular datasets, where a subset of 2k molecules of each dataset was used for KRR. The comparison of the three MAE landscapes reveals that the QM9 dataset of small organic molecules is easiest to learn among all three datasets, since the MAE values are lowest. This is in accordance with the previous case of KRR-CM. We observe the highest MAEs for the AA dataset, which is most difficult to learn due to the higher chemical complexity of the molecules.
Panels a) -c) show MAE landscapes in logarithmic (α γ) planes and reveal the familiar diagonal pattern containing the lowest MAE. Compared to the 2D case with the CM descriptor, the landscapes are more homogeneous and the locationẑ of optimal hyperparameters lies within the same region for all three datasets. The choice of α and γ seems to be independent of the dataset. Panels d) -f) show MAE landscapes in logarithmic (σ 2 , σ 3 ) plane. All three datasets feature a cross-like shape of low MAE values. For QM9 and OE, the optimal MAE roughly lies on the crossover point. For these two datsets, the σ 2 and σ 3 values do not have a significant influence on the MAE range. For the AA dataset, in constrast, the choice of σ 2 and σ 3 dramatically affects the quality of learning and should be set correctly.

C. Interpretation of MAE landscapes
We now discuss the distinctively different optimal regions of different hyperparameter classes. Figure 8 schematically depicts the shapes of these optimal hyper parameter regions for the KRR parameters (α-γ plane in panel a)) and the MBTR feature widths (σ 2 -σ 3 plane in panel b)).
To understand the triangular shape in panel a), we have to recall the two kernels 2 and 3 and the KRR regularization equation eq. 4. For large feature widths γ, the abstract molecular space in which we measure distances between molecules M and M , is filled with broad Laplacians or Gaussians. The kernel expansion in eq. 4 then picks up a contribution from almost any molecular pair For smaller values of the feature width γ, the Laplacians or Gaussians in molecular space become narrower, until, in the limit of an infinitely small γ, we obtain delta functions. The narrower the expansion functions are in molecular space, the larger the expansion coefficients need to be to give finite target properties. When the expansion coefficients increase in size, however, the regularization parameter needs to reduce to keep the size of the penalty term low. γ and α become co-dependent, which explains the diagonal line in the triangle.
For the MBTR hyperparameters, the situation is qualitatively different. Although σ 2 and σ 3 are also associated with feature widths, they control the broadening of features in the structural representation of molecular bond distances and angles. For large broadenings (i.e. large σ 2 and σ 3 ), peaks associated with individual features in the MBTR might merge and the MBTR looses resolution. For very small broadenings (i.e. very small σ 2 and σ 3 ) features are represented by very narrow peaks, which may not be captured by the MBTR grids. The MBTR again looses resolution. The hyperparameter sweet spot therefore lies in a roughly circular region of moderate σ 2 and σ 3 values. For our molecules and our datasets, the optimal σ 2 and σ 3 values are between 10 −1 and 10 −2 , so closer to the bottom left corner of the hyperparameter landscape.
This section demonstrates that visualizing the hyperparameter landscapes greatly faciliates our understanding of the hyperparameter behavior in KRR and in machine learning in general. The BOSS methods provides an efficient way of generating easily readable landscapes that enable a deeper analysis of machine learning models.
D. Convergence, scaling and computational cost 1. Convergence Figure 9 illustrates the convergence of BOSS as a function of iterations, using the CM and the MBTR as molecular descriptor, for different training set sizes. Here, we consider the global minimum locationẑ in the landscape as the surrogate model improves, compute its true MAE value f (ẑ) and track the lowest value observed. In the limit of the predefined maximum number of iterations (100 for 2D and 300 for 4D) the model no longer changes, so we adopt the final MAEs f (ẑ) as zero reference. In the final step, we subtract the reference from the sequence of lowest MAE values observed to obtain the bare convergence ∆f (ẑ) of the MAE with BOSS iteration steps. Figure 9 shows that ∆f (ẑ) drops quickly with BOSS iterations. Since the best MAE resolution we achieved with grid search was 0.02eV, we define the BOSS convergence criteria as ∆f (ẑ))≤ 10 −2 . We find that in the 2D case (KRR-CM in Fig. 9a)), the BOSS solution is already converged in fewer than 20 iterations, regardless of training set size. In the 4D hyperparameter search (KRR-MBTR in Fig. 9b)), BOSS reaches convergence in less than 50 iterations with underlying datasets of sizes 1k and 2k. For a dataset size of 4k, it takes almost 100 iterations to reach convergence. Some variation in convergence behaviour is expected, and averages from repeated runs would provide better averaged results in future work. Nonetheless, the best hyperparameter solutions were clearly found within 100 iterations in all scenarios.

Formal computational scaling
Formally, the computational time of a KRR run for a fixed training set size can be estimated as follows t total = n desc · t desc + n KRR · t KRR + t process . t desc is the average time to build the molecular descriptor for all molecules and n desc is the number of times the descriptor has to be generated. t KRR is the average time to perform the 5-fold cross-validated KRR step to determine the regression coefficients w and n KRR the number of times this has to be done. Finally, t process is extra time used by the BOSS method to refine the surrogate model and determine the location for the the next data point acquisition. To collect run time information we added timing statements to our KRR implementation. Figure 10 depicts the average time the BOSS and grid search algorithms need to build the molecular descriptor and run cross-validated KRR, as a function of training set size for KRR-CM and KRR-MBTR. Panels a) and b) show the timings for grid search and panels c) and d) for BOSS. We observe that our formal scaling estimates in the previous section are confirmed, as the average time for building the CM ( t CM ) or the MBTR ( t MBTR ) grow linearly with training set size, while t KRR grows cubically with training set size. Hence, for the smallest training set size of 1k, it might take less time to perform KRR than to build the molecular descriptor, while for training set sizes of 2k and larger the cubic scaling of the KRR part has already overtaken the descriptor building.

Computational time
The total computing time as a function of training set size is presented in Fig. 11. In the 2D case the grid search outperforms BOSS, while in the 4D case, BOSS is significantly faster than grid search. To determine which approach is faster -grid search or BOSS -it comes down to how often the descriptor has to be build, n desc , in each method and how often cross-validated KRR has to be performed, n KRR . Table III shows both numbers for grid search and BOSS. In grid search, n desc and n KRR are fixed numbers (see Algorithms 1 and 2). They depend only on the size of the hyperparameter grid. In KRR-CM, the CM needs to be computed only once at In BOSS, the molecular descriptor must be built and cross-validated KRR must be performed every single time the objective function is evaluated, as shown in Algorithms 3 and 4. This means that n desc and n KRR solely depend on the number of BOSS iterations required to converge the MAE landscapes.
Since t KRR scales cubically, the critical number is n KRR . As Tab. III illustrates, n KRR are roughly the same for grid search and BOSS in the 2D search. Already for a 4D search, BOSS requires significantly fewer KRR evaluations than grid search and is computationally much more efficient.

IV. CONCLUSION
In this work, we have used the Bayesian optimization tool BOSS to optimize hyperparameters in a KRR machine-learning model that predicts molecular orbital energies. We use two different molecular descriptors, the CM and the MBTR. While the CM has no hyperparameters, the MBTR molecular descriptor introduces two extra hyperparameters to the optimization problem. We therefore performed BOSS searches in spaces of up to four dimensions. We compared MAE landscapes in hyperparameter space and the efficiency of the BOSS approach with the commonly used grid search approach for three different molecular datasets.
For CM as molecular descriptor, only the two KRR hyperparameters α and γ need to be optimized. The 2D landscapes in hyperparameter space produced by BOSS and grid search agree very well, with the lowest MAE values lying on a diagonal and a horizontal line. This is the case for all three datasets and for all training set sizes.
For MBTR as molecular descriptor, MAE landscapes cut through the (log(α), log(γ))-plane qualitatively correspond to the MAE landscapes of the 2D optimization problem, while the overall prediction errors are notably lower. In the (σ 2 , σ 3 )-plane, the optimal MAE values are confined to a small, roughly spherical region. The MAE is not very sensitive to σ 2 and σ 3 , in contrast to α and for the optimal set of hyperparameter found by BOSS and grid search (GS). Results are depicted for three different molecular datasets QM9, AA and OE. For BOSS, the first value is the best ever observed true function value f(ẑ), evaluated at the predicted optimal pointẑ. The second value in squared brackets is the global minimum predicted by the surrogate model, µ(ẑ), at maximum number of iterations. For GS, the depicted value corresponds to the best model performance f (ẑ). γ. Hence, all combinations of σ 2 and σ 3 are reasonable choices for the MBTR.
In terms of efficiency, grid search outperforms BOSS in the 2D case, while in the 4D case, BOSS is significantly faster than grid search. This paves the way for highdimensional hyperparameter optimization with Bayesian optimization.