Bayesian optimization in ab initio nuclear physics

Theoretical models of the strong nuclear interaction contain unknown coupling constants (parameters) that must be determined using a pool of calibration data. In cases where the models are complex, leading to time consuming calculations, it is particularly challenging to systematically search the corresponding parameter domain for the best fit to the data. In this paper, we explore the prospect of applying Bayesian optimization to constrain the coupling constants in chiral effective field theory descriptions of the nuclear interaction. We find that Bayesian optimization performs rather well with low-dimensional parameter domains and foresee that it can be particularly useful for optimization of a smaller set of coupling constants. A specific example could be the determination of leading three-nucleon forces using data from finite nuclei or three-nucleon scattering experiments.

Abstract. Theoretical models of the strong nuclear interaction contain unknown coupling constants (parameters) that must be determined using a pool of calibration data. In cases where the models are complex, leading to time consuming calculations, it is particularly challenging to systematically search the corresponding parameter domain for the best fit to the data. In this paper, we explore the prospect of applying Bayesian optimization to constrain the coupling constants in chiral effective field theory descriptions of the nuclear interaction. We find that Bayesian optimization performs rather well with low-dimensional parameter domains and foresee that it can be particularly useful for optimization of a smaller set of coupling constants. A specific example could be the determination of leading three-nucleon forces using data from finite nuclei or three-nucleon scattering experiments.

Introduction
Mathematical optimization plays a central role in natural science. Indeed, most theoretical predictions are preceded by a calibration stage whereby the parameters of the model are optimized to reproduce a selected set of calibration data. In nuclear physics, the coupling constants of any theory of the strong interaction between protons and neutrons (nucleons) must be determined from experimental data before one can attempt to solve the Schrödinger equation to make quantitative predictions of the properties of atomic nuclei.
Typically, measured low-energy nucleon-nucleon (N N ) cross sections and the properties of light nuclei with mass number A ≤ 4 have been used for calibrating the N N and three-nucleon (N N N ) interaction sectors of the nuclear force, see e.g. Refs. [1,2,3] and references therein. However, it is still an open question-with several parallel and ongoing research efforts [4,5,6,7]-how to best constrain the unknown coupling constants in theoretical descriptions of the nuclear interaction and how to incorporate the covariance structure of the experimental data and the model discrepancy [5,8,9,10]. Recent theoretical studies [7,11,12,13] indicate that N N scattering data and the properties of very light nuclei do not contain enough information to constrain all directions in the parameter domain at sufficient level. Instead, it has been proposed that the pool of fit data need to be augmented with complimentary data types, such as N N N scattering cross sections, and/or the ground-state properties of nuclei with A > 4, or even empirical properties of infinite nuclear matter. However, modeling of such observables is typically much more complex and requires substantial computational effort ranging from hours to days for just one model evaluation, even on a supercomputer. Consequently, the optimization of the underlying model parameters will be difficult. The main focus of the present work is to investigate a possible strategy for mitigating this computational challenge.
Inspired by recent progress in the optimization of hyperparameters of deep neural networks [14], we explore several Bayesian optimization (BayesOpt) strategies ‡ for maximizing the likelihood of objective functions based on complex models in nuclear physics. BayesOpt originated more than 50 years ago [16], it was popularized in the 1990s, see e.g. [17,18], and has since then been applied in various disciplines; from selecting experiments in material and drug design [19] to tuning event-generators in particle physics [20]. The central idea in BayesOpt is to construct a probabilistic surrogate model, usually a Gaussian process (GP), to capture our prior beliefs and knowledge about the objective function, f (x), and iteratively confront the surrogate with actual data samples from f (x) and thereby refine our posterior of this function. The main advantage of BayesOpt is that we can incorporate prior beliefs in a straightforward way. The disadvantage lies in the arbitrariness and uncertainty of a priori information.
In the following we will be dealing with complex models in nuclear physics. The origin of the underlying physics model and its parameters is briefly introduced in this ‡ we employ the BaysOpt implementation provided through the Python package GPyOpt [15]. section, while more details and relevant references are provided in Appendix B. Nucleons are made of quarks and gluons, and it is well known that the strong interaction between these fundamental particles is described in detail by quantum chromodynamics (QCD), which is part of the standard model of particle physics. It is equally well-known that QCD is not perturbative in the low-energy region relevant for nuclear structure physics. This prevents straightforward application of, e.g., perturbation theory to compute atomic nuclei starting from QCD. Instead, chiral effective field theory (χEFT) [21,22,23] is constructed as a low-energy approximation to QCD. This framework shows promising signs of being an operational approach to analyzing atomic nuclei while maintaining a firm link with the more fundamental theory. In χEFT, the long-ranged part of the nuclear interaction is described in terms of pion exchanges, while the short-ranged part is parameterized by contact interactions. The unknown parameters of this description are known as low-energy constants (LECs). It is important to realize that any realistic description of the nuclear interaction has to introduce unknown coefficients, but it has been found that χEFT is able to capture the physics with a relatively small number of parameters. Depending on the level of details included, there are typically ∼ 10-30 LECs.
Clearly, if model predictions of physical observables, such as scattering cross sections, are computationally expensive, we face the challenge of optimization with a limited budget of evaluations of the objective function. For the overwhelming majority of well-developed research problems there does not exist a universal optimization strategy that guarantees to arrive at a global optimum of an objective function in a finite number of iterations. Instead, any information regarding the mathematical or computational structure of the objective function, perhaps guided by the physical nature of the underlying problem, should play an important role in the choice or design of the optimization algorithm.
In this paper, we will systematically study the application of BayesOpt to optimization problems of increasing degree of complexity. The BayesOpt algorithm is presented in Sec. 2 with its main ingredients: the GP kernel and the acquisition function. The performance of different optimization algorithms can be compared using a data profile. This measure, as used in the present work, is introduced in Sec. 3. In order to benchmark the performance of BayesOpt with various settings in controlled problem conditions we will employ a selected set of six test functions in D = 2, 4, 8 dimensional parameter domains. This study is presented in Sec. 4, while the test functions themselves are listed in Appendix A. The main focus of this work is found in Sec. 5 with the application of BayesOpt to a real nuclear physics problem. We will use BayesOpt to optimize the 12 LECs appearing at next-to-next-to-leading order (NNLO) in χEFT, using the proton-neutron scattering data in the Granada database [24] with laboratory energies of the incident proton beam below 75 MeV. This case is complex enough to constitute a non-trivial problem from a physics as well as an optimization perspective. However, it is still computationally straightforward such that we can easily afford a detailed analysis of 12 different BayesOpt algorithms. Indeed, each evaluation of the objective function at a specific point in the parameter domain only takes a couple of seconds on a standard desktop computer. Still, evaluating the 4,096 corners in the hypercube of the corresponding parameter domain will take a couple of hours, so the premise for BayesOpt is well justified. We will compare the optimization performance of BayesOpt with the POUNDERs algorithm [25]. This is a simulation-based optimization algorithm that has been successfully applied in non-linear least-squares optimization in nuclear physics before [26,27]. Finally, we end with a summary and outlook in Sec. 6.

Bayesian optimization
Without any loss of generality, we will frame the determination of the LECs in χEFT as a minimization problem. Global minimization of a function f : R D → R, with input parameters x that are perhaps subject to some constraints c(x) ≤ 0 and typically belong to a compact input domain X ⊂ R D , is a long-standing and central problem in science. Here, we also specialize the formalism to scalar valued functions f . Mathematically, we are trying to find a global minimizer: i.e. a point that fulfill f (x ) ≤ f (x) for all x ∈ X . In general, this is an intractable problem unless we have detailed information about x or that the parameter domain only contains a finite number of points. In reality, we are trying to find local minimizers to f (x), i.e. points x for which f (x ) ≤ f (x) for all x ∈ X close to x . A substantial complication arises if f (x) is computationally expensive to evaluate. Then it becomes even more important to adopt a well-founded strategy, based on all present knowledge about f (x), for carefully selecting a dataset of n sequential function queries , such that we increase our chances of a rapid convergence towards x . A BayesOpt framework can provide this. Moreover, at each iteration BayesOpt will consider all insofar collected data points and thereby take full advantage of the history of the optimization run. Note that we refer to a set of function evaluations as data. This should not be confused with experimental data. There are two main components in a BayesOpt algorithm; • A prior probabilistic belief p(f |D) for the function f given some data D. The prior is often a GP. This is updated in every iteration.
• An acquisition function A(x|D) given some data D, i.e. a heuristic that balances exploration against exploitation and determines where to evaluate the objective function f (x) next.
The next iterate, x i+1 , is selected where we expect the minimizer x , based on some utility function. Below, we will define two different acquisition functions A(x|D), and show how to embed them in an iterative context for selecting sample points x i . In the following we will also drop the explicit data dependence in the notation for the acquisition function and only write A(x). A pseudo-code for BayesOpt is listed in Algorithm 1 and a pictorial exposition of a handful of BayesOpt iterations of a simple univariate function is provided in Fig. 1.
evaluate the objective function f (x) to obtain y i = f (x i ) for i = 1, . . . , k 3: initialize a data vector D k = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x k , y k )} 4: select a statistical model for f (x) 5: for n = k + 1, k + 2, . . . )(x −µ(x ))] for all (x, x ) ∈ X . Any real-valued function µ(·) is permissible, but for k(·, ·) the corresponding covariance matrix K must be positive semidefinite. A GP is one example of a stochastic process that is very useful in statistical modeling [28]. In brief, it is a collection of function evaluations y 1:n at x 1:n , with mean µ 0 (often shifted to zero), that are jointly Gaussian, i.e.
where y ∼ N (µ, σ 2 ) denotes a normally distributed random variable y with mean µ and covariance σ 2 . After conditioning this prior with some data D n = {(x 1 , y 1 ), . . . , (x n , y n )}, with mean µ 0 , we obtain the mean and covariance of the GP model for the prior according to Two initial function evaluations f (x = 0.05) and f (x = 0.9) (filled markers) were randomly selected. In each iteration the next evaluation point (dashed gray line) in the parameter domain is determined from the maximum of the expected improvement acquisition function. For each iteration, the corresponding value of f (x) is indicated with an empty marker. The mean (green line) and 95% confidence interval (green region) of a GP with a squaredexponential kernel, sequentially approach f (x). After iteration 3, the algorithm leaves the rightmost x-domain, and the associated local minimum, to explore the region containing the global minimum.
These explicit expressions follow from the fact that the marginal distribution of a multivariate Gaussian is also Gaussian. The mean and variance of the prior, i.e. the GP, for the schematic example in Fig. 1 are indicated with a green line and a green shaded band, respectively. In this work we will consider three common types of covariance functions, also referred to as kernels: is a set of hyperparameters for the kernel. The correlation length(s) = (θ 1 , . . . , θ D ) indicate how far you need to move (along a particular direction in the parameter domain) for the function values to become uncorrelated. With automatic relevance determination (ARD), we optimize the vector of correlation lengths separately in each direction of the function domain X ⊂ R D . Without ARD, the kernel hyperparameters are isotropic. In this work, we extract the hyperparameters using maximum likelihood estimation.
The characteristic features of the GPs based on the three kernels listed above are illustrated in Fig. 2. Their smoothness, equivalent to the typical correlation length, is one of the key differences between them. This feature of the GP kernel affects the prior modeling of the objective function, and as such the ensuing performance of BayesOpt. We will see this clearly in the analyses in Secs. 4-5.

The acquisition function
The acquisition function determines the most likely improvement to the currently best minimizer in the parameter domain. In Fig. 1 the mean of our posterior belief (green line) of the unknown values of the objective function f (black line) is sequentially augmented with one new data point (black dot) in each iteration. The best candidate for further minimizing f at iteration n is the parameter that maximizes the acquisition function (red curve). Although the acquisition function is also optimized in X ⊂ R D it is significantly faster to evaluate, compared to the underlying objective function f (x), since it only relies on draws from the prior GP. Still, the complexity of maximizing A(x) increases as we increase the dimensionality of the parameter domain X . This aspect should not be underestimated, and it is in fact one of the main challenges with BayesOpt. Another challenge, although unlikely, could emerge if the set of collected data points D 1:n becomes very large. Indeed, in each iteration the evaluation of the GP requires an inversion of an n × n matrix, and the complexity of that operation naively scales as O(n 3 ). Cholesky decomposition reduces this cost somewhat to O(n 3 /6). In reality, however, this is rarely a limiting factor since we typically resort to BayesOpt when only a small number of function evaluations can be carried out in the first place.
A very appealing feature of BayesOpt is its ability to select a new point x n in a region of X where the prior model of f is exhibiting a large uncertainty. This means that the algorithm can be rather explorative and therefore escape a local minimum of the objective function f . Depending on the details of the acquisition function, the explorative nature is balanced with a certain degree of exploitation, i.e. to evaluate points in the parameter domain where the prior model for f is exhibiting a low mean value. To study the exploration-exploitation balance we will consider two of the most common acquisition functions; the expected improvement (EI) and the lower confidencebound (LCB). In the following, we denote by f min the insofar lowest recorded value of f (x).  , and full probabilistic model (right column) using the three kernels we employ in this paper; Matern 3/2 (first row), Matern 5/2 (middle row), and squared exponential (bottom row). For each kernel, the priors are confronted with two identical data points (black dots) and the two corresponding hyperparameters, covariance and correlation length ( ), are optimized to maximize the marginal likelihood of the data. The resulting correlation lengths for each kernel are provided in the middle column. Clearly, the Matern 3/2, Matern 5/2, and squared exponential kernels are increasingly smooth.
The expected improvement acquisition function is defined by the expectation value of the rectifier max(0, f min − f (x)), i.e. we reward any expected reduction of f in proportion to the reduction f min − f (x). This can be evaluated analytically where we dropped the explicit dependence on x in the third step, and the cumulative distribution function and standard normal distribution are denoted Φ and φ, respectively. In the last step we write the result in the standard normal variable z = f min −µ σ . BayesOpt will exploit regions of expected improvement when the term zΦ(z) dominates, while new, unknown regions will be explored when the second term φ(z) dominates. For the expected improvement acquisition function, the explorationexploitation balance is entirely determined by the set of observed data D 1:n and the GP kernel.
The lower confidence-bound acquisition function introduces an additional parameter β that explicitly sets the level of exploration The maximum of this acquisition function will occur for the maximum of the β-enlarged confidence envelope of the GP. We use β = 2, which is a very common setting. Larger values of β leads to even more explorative BayesOpt algorithms.
Besides being derivative-free-although derivatives can in fact be incorporated [29]-it is in many ways the explorative nature of BayesOpt that is most attractive. This feature is most easily observed in the optimization of a rather complex two-dimensional function with several local minima, such as the Langermann function In Fig. 3 we show the sequence (x i , y i ) 51 i=0 of 50 evaluation points following two initial points of BayesOpt with an Matern 3/2 kernel, expected improvement acquisition function without ARD. It should be made clear that, given a limited computational budget, the success of BayesOpt hinges on the choice of kernel and acquisition function. In the example above with a Langermann function, the non-smooth nature of the Matern 3/2 kernel is advantageous compared to, e.g., the squared exponential kernel. The exploration-exploitation balance also leverages the success ratio of BayesOpt. To learn more about BayesOpt, it is obviously instructive to benchmark and compare different BayesOpt algorithms using well-known test functions. For this purpose we first need to select a performance measure for optimization algorithms.

Measuring the performance of optimization algorithms
To analyze the performance of derivative-free optimization algorithms we follow Ref. [30] and define a data profile The data profile enables direct comparison between a set of optimization algorithms S, all of which are applied to a set of well-defined optimization problems P. For each (s, p) ∈ S × P, the performance measure t s,p > 0 denotes the number of function evaluations that are required for optimization algorithm s applied to a problem p to satisfy some convergence criterion. Thus, d s (α) is the fraction of problems that can be solved within α function calls. The performance measure can be further normalized to D p + 1, where D p denotes the dimensionality of the parameter domain in problem p. This is an attempt to account for some of the complexity due to a larger number of parameters in the problem. This dimensional scaling is only approximate and motivated by the (D p + 1) function evaluations required to compute a D-dimensional simplex, i.e. a D-dimensional triangle of function evaluations. Each combination of starting point and objective function (plus other possible constraints) constitutes a separate problem p. The data profile is a monotonically increasing function between zero and one, and a large value of d s (α) for small values of α is better. In line with Ref. [30] we employ a convergence criterion This is fulfilled for any x where the initial function value f (x 0 ) is reduced 1 − τ times the best possible reduction f (x 0 ) − f L . We will set f L to be the lowest value of f achieved by any solver s ∈ S. Although we will come close to the true solution x in a few cases, it is highly unlikely that any derivative-free solver will arrive at f (x ). For that, one would typically have to resort to gradient-based optimization algorithms. With BayesOpt, we will consider a 90% reduction in the function value as a sign of convergence, i.e. set τ = 0.1. For the purpose of comparing BayesOpt with other derivative-free optimization algorithms we will occasionally use τ = 0.01. When the objective functions are represented by known test functions (described in Sec. 4 and Appendix A), we will set f L = f (x ). Throughout this study we will only allow a maximum of α =250 function calls per solver and problem, and set t p,s = ∞ in case the convergence criterion is not fulfilled. This upper limit on α should cover most scenarios that would call for the use of BayesOpt.

Selecting parameter starting points in R D
A multi-dimensional parameter domain X ⊂ R D of the objective function leads to the increasingly likely existence of multiple points corresponding to local and/or global minimizers. In reality, the choice of initial point, i.e. where we start the optimization run, will determine the local minimum that the optimizer converges to. When benchmarking, it will be necessary to start each solver s from N > 1 starting points in X . In this work, we use an identical set of N = 2D starting points for all optimizers in a D−dimensional domain. This is somewhat motivated by Wendel's theorem [31] that states that for N random points on the D−sphere, the probability P (N, D) that they all lie on some hemisphere is given by Thus for random N = 2D points on the D−sphere, P (2D, D) = 0.5. Although we do not distribute points on a hypersphere, we use it to motivate a rule of thumb. It should be pointed out that some authors argue for a slightly more expensive rule of thumb to select N = 10D starting points when initializing computer simulations in R D [32]. A priori, we do not distinguish between different parts of the parameter domain, so we select the N starting points using a space-filling algorithm in the form of a quasi-random Sobol sequence. We employ a Python implementation of this algorithm. The mathematical underpinnings of the Sobol sequence are provided in the original paper [33] and a discussion related to the numerical implementation is given in e.g. Ref. [34]. This is a so-called low-discrepancy sequence, which in fact is the opposite of a random sequence. It is designed to generate each successive sample point as far away as possible from all existing sample points. This tends to sample the space more uniformly than pseudo-random numbers, at least for lower-dimensional domains. We will not go beyond D = 12 in any part of this work. Although the Sobol sequence can exhibit gaps in multi-dimensional spaces, it has several advantages. In addition to fast generation, a Sobol sequence is straightforward to augment with additional sample points. We remind the reader that Latin hypercube sampling (LHS) [35], which is a different kind of algorithm for generating space-filling samples, is in most cases not easy to augment while preserving the latin hypercube structure. The space-filling differences between Sobol, LHS, and conventional pseudo-random numbers in R 2 are illustrated in Fig. 4.

Sobol sequence
Latin hypercube sampling Mersenne twister

Analyzing a set of test functions
Before we tackle an optimization problem in nuclear physics, we will explore and benchmark the performance of BayesOpt on a set of test functions. To this end, we have selected a set of six multivariate and continuous functions f : R D → R, each defined on some domain X ⊂ R D . The functions are defined for any D > 0, but we will only consider D = 2, 4, 8. The set of test functions reflects an average of various spatial characteristics. Two-dimensional graphical representations are shown in Figs. 5 and 6, with explicit expressions given in Appendix A. A comparison of two or more optimization algorithms on a finite set of test functions is neither fair nor conclusive. Indeed, although one algorithm (or class of methods) appears to be more successful in finding optima, it is clear from the no free lunch theorem in optimization that the averaged performance of any two algorithms on the set of all possible functions will be the same. Here, we merely set out to compare and analyze the performance of BayesOpt on a limited set of characteristically different continuous test functions.
We now turn our attention to the resulting data profiles when applying BayesOpt for finding the minimizer in each one of these test functions with parameter domains in D = 2, 4, 8 dimensions. In total, we will analyze 12 BayesOpt algorithms composed from combining three kernels, two acquisition functions, and with or without ARD. We use a Sobol sequence to initiate each BayesOpt algorithm at N = 2D different starting points in each parameter domain in R D . Remember that we refer to each specific combination of starting point and test function as a problem. Combined, with the different solvers (optimization algorithm with specific settings) this is a rather large dataset and we analyze it from several different angles. The data profiles for different versions of BayesOpt applied to all six test functions in D = 2, see Fig. 7 (top row), indicate that d(50) ∼ 0.7, i.e. that ∼ 70% of the problems are converged at the τ = 0.1 level within 50 function evaluations. The performances of the expected improvement and the lower confidence-bound acquisition functions are very similar. Although, expected improvement exhibits a slightly better and more coherent performance across all GP kernels, and at 150 function evaluation more than 80% of the test functions in D = 2 are converged. ARD, i.e. to learn each individual length scale hyperparameter in the GP kernel, is often an efficient way of pruning irrelevant features. We find that exploiting ARD increases the performance of BayesOpt once data from a sufficient number of function evaluations has informed the GP kernel on the possible anisotropic structure of the objective function. This becomes even clearer if we enforce a stronger convergence criterion with τ = 0.01, see Fig. 7 (bottom row). There seems to be a slight advantage of using ARD with the expected improvement acquisition function.
In D = 2, and for this set of functions, the Matern kernels perform slightly better than the squared exponential. In general, the Matern kernels are better tailored to non-smooth objectives. This becomes even clearer when we study the performance of all BayesOpt algorithms on the Ackley function (one hole on a semi-flat surface with   several periodic shallow local minima) in R D=2,4,8 , see Fig. 8. From the characteristics of the data profiles shown in Fig. 8, it is obvious that the exploratory nature of the lower confidence-bound acquisition function is highly advantageous for finding the global minimum hiding on a surface covered with local minima. Clearly, it is important to tailor the BayesOpt acquisition function and underlying GP kernels to the spatial structure of the objective function. This result also reflects the two fundamental and competing aspects of BayesOpt. If we incorporate prior beliefs, or even knowledge, about the objective function, then BayesOpt will perform rather well. On the other hand, the  usefulness of BayesOpt will consequently be limited by the arbitrariness and uncertainty of a priori information. This is further complicated by the fact that we typically resort to BayesOpt when we know very little about the objective function in the first place, since it is computationally expensive to evaluate. Data profiles for the test functions as we increase the dimensionality of the parameter domain from D = 2 (in Fig. 7) to D = 4 and D = 8 are shown in Fig. 9. We can conclude that having a larger number of parameters provides a significant challenge to BayesOpt, as for all optimization algorithms. Indeed, in D = 4 dimensions it takes more than α = 150 function evaluations to reach a data profile value of d ∼ 0.5. The more exploratory lower confidence-bound acquisition function exhibits a slightly larger performance spread with respect to different GP kernels. This becomes even more prominent as we increase the dimensionality of the parameter domain to D = 8. For the set of test functions that we have employed it is marginally advantageous to be more exploratory for higher dimensional objective functions. We also note that the potential benefit of ARD requires data from more than α = 250 function evaluations. This is natural since ARD introduces more hyperparameters that need to be determined. For all problems in D = 4, 8 dimensions, BayesOpt with Matern kernels converge faster than BayesOpt with a squared exponential kernel. As above, the squared exponential kernel does not capture the high-frequency modes that are present in some of the functions.

Bayesian optimization of the nucleon-nucleon interaction
An EFT of the nuclear interaction essentially corresponds to a low-energy parameterization of QCD in a fashion that is consistent with the symmetries of the more fundamental theory. To devise a true EFT description of the nuclear interaction is a major intellectual challenge in ab initio nuclear theory. Several avenues are currently being explored. The physical underpinnings of the objective function we seek to minimize are described in somewhat more detail in Appendix B. Regarding the optimization of the LECs, the inclusion of calibration data from more complex atomic nuclei and low-energy nuclear processes, e.g. N N N scattering, in the objective function are currently being considered in the community. Such extensions of the calibration data clearly increases the information content-but does so at the expense of an increased complexity in the computer modeling.
A specific aim of this work is to analyze the performance of BayesOpt for determining the parameters x in an EFT model of the N N interaction. For this, we use the proton-neutron sector of an EFT at NNLO with 12 parameters and try to find the vector of model parameters x that are in agreement with existing experimental data on proton-neutron scattering cross sections. This type of observable is a measure of the probability of an incident particle (a proton) with a given kinetic energy to scatter into a certain solid angle element due to mutual interaction with the target particle (a neutron §). We have deliberately chosen this class of calibration data since it does not render particularly challenging model evaluations. One evaluation of the full objective function f (x) takes a merely couple of seconds on a standard desktop. Still, the complexity of the physical model provides a non-trivial testing ground for assessing nascent applications of BayesOpt in ab initio nuclear physics.
The experimental dataset is composed of N G groups of measurements where the gth group consists of N g,d measured cross sections, with associated random measurement uncertainty, O experiment g,i ± σ g,i , for i = 1, . . . , d, with a common normalization constant ν g and corresponding experimental systematic error σ g,0 . We employ the measurement § Since free neutrons decay within ∼ 10 minutes, the neutron target is a composite material containing neutrons. errors as reported by each experimenter. Each group of data originates from a specific experiment. We restrict the inclusion of experimental data in the present case to laboratory scattering energies below 75 MeV. This ensures a rather fast (seconds) evaluation of the objective function and therefore enables a more detailed analysis of BayesOpt. Experimental errors across measurement groups are considered independent and identically distributed. The normalization constant, together with its uncertainty, represents the systematic uncertainty of the measurement. For an absolute measurement, the normalization is given by ν g = 1 ± 0. Usually this means that the statistical and systematic errors have been combined with σ g,d . Certain experiments are not normalized at all. Instead, only the angular-or energy-dependence of the cross section was determined. For these groups of data, ν g is solved for in the optimization by minimizing the discrepancy between the model prediction O model g,d (x) and the experimental data points O experiment g,d . For such freely renormalized data, the optimal ν g is easily obtained in closed form. For practical purposes, the normalization error can be considered infinite in these cases, and will therefore not contribute to the objective function.
In summary, we seek to find the parameter vector x that minimizes the deviation between the model and the experimental data, as measured by the objective function f defined below: This type of (non-linear) least-squares objective function, where we assume a normal model for the parameters, appears naturally in most parameter estimation problems. In a setting defined by χEFT it would be natural to also incorporate a theoretical model discrepancy term motivated by the low-energy EFT expansion itself [5,8,9,36,37]. However, in this paper we will focus on the challenges of mathematical optimization that are associated with a numerically costly objective function, and for simplicity therefore only incorporate the experimental errors of the data.
We define three different parameter domains for minimizing the objective function; small, medium, and large, see Tab. 1. They differ by the level of included prior knowledge regarding the range of plausible parameter values. The limits of the large parameter domain is based on the most naive estimate without any significant prior information. In contrast, the limits of the medium domain are partly informed by prior data. Specifically, the three parameters (LECs c 1 , c 3 , c 4 ) associated with the long-range part of the nuclear interaction are constrained to ranges determined from a separate analysis of pion-nucleon scattering data [38,39]. The small domain is further informed by previous experience of typical values for the LECs in the short-ranged contact potential.
The data profiles of the BayesOpt algorithms applied to all domains of the nuclear physics objective function in Eq. 10 are plotted in Fig. 10. Clearly, the expected  Table 1. Three different parameter domains (small, medium, large) of the 12 parameters (LECs) in the nuclear physics objective function we study here. The LECs govern the short-range contact potential (C · ) at leading order (unit: 10 4 GeV −2 ), the short-range contact potential (C · ) at next-to-leading order (unit: 10 4 GeV −4 ), and the sub-leading long-ranged pion potential (c · ) at next-to-next-to-leading order (unit: GeV −1 ). BayesOpt: Lower Confidence Bound Figure 10. Data profiles for BayesOpt applied to the nuclear physics objective function in Eq. 10, over all parameter domains in Tab. 1. The expected improvement acquisition function for all kernels, except Matern 5/2, exhibits a better performance compared to the lower confidence-bound acquisition function. Improving the performance for lower confidence-bound will require further function evaluations. improvement acquisition function performs slightly better than the lower confidencebound acquisition function. It is difficult to draw any conclusions regarding an optimal choice of kernel. Compared to the test functions, the squared exponential kernel seems to work a little bit better than the Matern 5/2 kernel. This result is also somewhat surprising since the performance of the squared exponential kernel is nearly identical to the Matern 3/2 kernel. When we analyze the performance of BayesOpt in each of the three parameter domains separately; small, medium, large, see Fig. 11, we note that the medium and large domains stand out. In the former, the expected improvement acquisition with the Matern 3/2 kernel and ARD algorithm achieves a 50% performance already after 50 iterations. This is the top performing BayesOpt algorithm with the nuclear physics objective function. In the small and large domains the Matern 3/2 kernel performs best without ARD. Since BayesOpt will work best with a sensible prior, this suggests rather short correlation lengths. In the large domain, the lower confidencebound acquisition shows a tendency to perform better than expected improvement. This is perhaps not too surprising since the parameter domain is large enough to benefit from substantial exploration. Clearly, the ARD kernels require much more data in larger spaces in order to prune irrelevant features of the objective function.

Comparison with the POUNDERs algorithm
To facilitate a benchmark and further analysis of the strengths and weaknesses of BayesOpt, we have also optimized the nuclear physics objective function using the POUNDERs (Practical Optimization Using No Derivatives for sums of Squares) optimization algorithm [25]. This is a well-tested derivative-free algorithm for optimizing non-linear least squares problems. Indeed, it has been applied successfully in basic nuclear physics research since almost a decade [26,27]. The key feature of POUNDERs is that it assumes a least-squares structure of the objective function, i.e. that it consists of a sum of squared residuals R i (x) written as Tailoring an optimization algorithm to exploit this mathematical structure, i.e. that each term is a squared function of the parameters x, is very fruitful. A quadratic model, for an objective function f , at the current iterate x k , with g k = ∇f (x k ) and H k = ∇ 2 f (x k ), is a very common choice. If the corresponding derivatives are known the subproblem of minimizing m k can be solved quite efficiently. However, derivatives ∇f and ∇ 2 f are considered unavailable for the present problems and only a set of function values f (y (j) ), and residual values R i (y (j) ), for some set Y = y (1) , y (2) , . . . , y (n) can be accessed. POUNDERs sets up a quadratic model for each residual i in Eq. 11 centered around the current iterate x k ∈ X ∈ R D . The model for each residual i is defined by the parameters c (i) ∈ R, g (i) ∈ R D , and H (i) ∈ R D×D . These model parameters are determined by solving an interpolation problem in R D , see Ref. [25] and references therein. Once the model parameters are obtained, they can be used to approximate the derivatives of the objective function. In principle, the first-and second-order derivatives of f (x) with respect to x are given by i ∇R i (x)R i (x) and , respectively. Consequently, POUNDERs sets up a master model M k for the objective function It should be noted that the master model itself does not interpolate the objective over the interpolation set Y.
POUNDERs is a trust-region method. The master model M k is believed to approximate f in a neighborhood B k of the current iterate x k , where The master model is therefore minimized over some B k with radius δ k > 0. The solution s k to min {M k (x k + s) : ||s|| ≤ δ k } determines the next iterate x k+1 and a new trust region radius δ k+1 is determined based on how good the model prediction M k (x k + s k ) was, see Ref [25] for a full specification of the algorithm. We are running POUNDERs in the default mode, meaning that we only have to input x 0 ∈ R D , M 0 , some initial trust-region radius 0 < δ k < 0.5, and lower and upper bounds l i , u i of the domain X . We also scale the problem such that the bounds correspond to a unit hypercube [0, 1] D . This scaling is not performed in BayesOpt.
In Fig. 12 we present the data profiles for POUNDERs applied to the physics objective function in the small,medium, and large domains and compare with BayesOpt. The results are only compared with the expected improvement acquisition function as it was shown in Fig. 11 to perform significantly better than the lower confidencebound acquisition function for this optimization problem. We remind the reader that all algorithms are initiated from an identical set of 24 starting points {x 1 , x 2 , . . . , x 24 } for each parameter domain. As seen clearly in Fig. 12, the choice of initial trust-region radius δ 0 determines the performance of POUNDERs. Setting the initial radius too small (δ 0 0.15) hampers the POUNDERs algorithm by trapping it in a shallow local minimum. This is not an issue with BayesOpt which continues to improve as more and more function values extends the data vector D. As we have already noted several times, the overall performance of BayesOpt depends crucially on the prior. We also note that when POUNDERs performs well, it does so within rather few function evaluations, but halts once it is trapped in a local minimum. The advantages of BayesOpt are most prominent when optimizing over the medium domain. Regardless of kernel, BayesOpt performs rather well even with few function evaluations. In the large domain, the good Step init = 0. 30 Step init = 0. 25 Step init = 0.20 Step init = 0. 15 Step init = 0.10 Step
performance of POUNDERs clearly indicates the usefulness of encoding prior knowledge about the mathematical structure of the objective function. There is likely some large scale structure in the objective function that the BayesOpt kernel would benefit to learn about. Therefore, it would probably be advantageous to amend the BayesOpt prior with a polynomial regression term fitted to the first few evaluations of the objective function. BayesOpt is not intended for pinpointing the exact location of an optimum. In the neighborhood of an optimum most objective functions can be well approximated by a quadratic polynomial. For this reason, POUNDERs will always outperform BayesOpt when decreasing τ in the convergence criterion given in Eq. 8. For τ = 0.01, i.e. a convergence criterion corresponding to a 99% reduction of the objective function, BayesOpt will only reach a performance of d(α < 250) ≈ 0.15, whereas POUNDERs can approach d(α) ≈ 0.35 for an optimal choice of the initial trust-region radius δ 0 , see Fig. 13.

Summary and Outlook
Some of the most interesting optimization problems in nuclear physics, as well as other fields of science, typically render computationally expensive objective functions defined on multi-dimensional parameter domains. Moreover, derivatives with respect to those parameters are usually not accessible.
In this paper we explore the potential benefits of BayesOpt (Sec. 2) for efficiently exploring the parameter space of a χEFT (Appendix B) in computationally complex circumstances. A local minimum, with realistic physical properties, in this parameter domain allows for numerical simulations of atomic nuclei and therefore improves our understanding of the origin, evolution, and structure of all matter in the universe. The underlying optimization problem is therefore well known in the nuclear physics research community and several classes of numerical optimization algorithms have already been employed; derivative-based [5] as well as derivative-free approaches [27].
BayesOpt presupposes a prior on the objective function, usually in the form of a GP. The original optimization challenge is transformed to a design problem that boils down to choosing an appropriate acquisition function to facilitate an explorationexploitation balance. This choice is encoded in a utility function that decides where to collect training data for the GP. Several choices of kernels and utility functions exist. Our initial studies of BayesOpt applied to a set of six test functions with parameter domains in R 2,4,8 clearly demonstrate the importance of a sensible prior assumption of the objective function, see Fig. 7. From this analysis it is also clear that BayesOpt performs rather well in low-dimensional (R 2 and R 4 ) parameter domains. It turns out that the choice of acquisition function is even more important than the choice of GPkernel, see Fig. 8. This is something we see also when we study the data profiles of BayesOpt applied to the nuclear physics objective, see Fig. 11.
Our main findings and conclusions can be summarized as follows: • In general, BayesOpt will never find a narrow minimum nor be useful for extracting the exact location of any optimum. For that to work, in anything but a trivial case, it is necessary to have detailed information about the objective function and successfully encode this prior knowledge into the algorithm. This is a situation that we typically do not have, since BayesOpt is applied to computationally expensive objective functions. Instead, BayesOpt might find use as a first stage in a hierarchical optimization protocol to identify an interesting region of the parameter domain. It might also be advantageous to design an acquisition function that is more explorative during the first iterations, and then switch to an acquisition function that exploits more than it explores.
• When we compare with the POUNDERs algorithm, Sec. 5.1-a derivative-free optimization algorithm that successfully incorporates the squared-sum structure of the objective function-we find that BayesOpt in ab initio nuclear physics would probably benefit from a prior with a polynomial regression term to efficiently capture the large scale structure of the objective function.
• We find that the choice of acquisition function is more important than the specific form of the GP-kernel. For the present case, the expected improvement acquisition function performed slightly better than the lower confidence-bound in smaller parameter domains, while more exploration as achieved with the lower confidencebound acquisition function, was shown to be beneficial in larger domains.
• The GP-kernel can be improved with ARD tuning of the hyperparameters. However, this feature is only useful if a minimum number of iterations can be afforded. In fact, the ARD kernels requires significantly more data in larger spaces in order to prune irrelevant features of the objective function.
• Although we employ an objective function consisting of independent scattering cross sections, and all of them with similar and low computational cost, a multifidelity scenario is equally probable. Consider e.g. to calibrate an EFT model of the nuclear interaction using scattering cross sections and data on bound states in multi-nucleon systems such as isotopes of oxygen and calcium. In such scenarios, where the computational cost of solving the Schrödinger equation for bound-states of a nucleus with A nucleons naively grow exponentially with A, it would be interesting to study the benefits of existing BayesOpt frameworks that can maximize information gathering across multiple functions with varying degrees of accuracy and computational cost. See e.g. Ref. [40] and references therein.
minimum. This function has 3 D − 1 local minima in R D .
Global minimum: x = (1, . . . , 1), and f (x ) = 0. (A.4) • Schwefel: smooth surface with several local minima and a global minimum located far away in a corner, which in turn is far away from the second best local minimum.

Appendix B.2. Proton-neutron scattering observables
Proton-neutron elastic scattering observables are calculated from the spin-scattering matrix M [41,42]. This is a 4 × 4 matrix in spin-space that operates on the initial state to give the scattered part of the final state. M is related to the conventional scattering matrix S by M = 2π ik (S − 1), where k is the relative momentum between the nucleons. The S-matrix for the scattering channel with angular momentum J can be parameterized by the Stapp phase shifts S = e 2iδ J [43]. The Stapp phase shifts are calculated from the potential V (p f , p i ) by solving the Lippmann-Schwinger equation. This equation describes quantum-mechanical scattering, and is an integral equation of the Fredholm type that can be solved as a matrix equation. In our application, and for each value of the on-shell momentum k, this amounts to inverting a 200-by-200 matrix followed by a matrix-vector multiplication. The matrix inversion prevents linearizing this particular EFT model in its parameters. Although, the matrix inverse is not particularly timeconsuming in the present case, it should be pointed out that the complexity of the corresponding quantum-mechanical equations for describing scattering states, or bound states, of more than two nucleons typically scale exponentially with the particle number.