Benchmark test of Black-box optimization using D-Wave quantum annealer

In solving optimization problems, objective functions generally need to be minimized or maximized. However, objective functions cannot always be formulated explicitly in a mathematical form for complicated problem settings. Although several regression techniques infer the approximate forms of objective functions, they are at times expensive to evaluate. Optimal points of"black-box"objective functions are computed in such scenarios, while effectively using a small number of clues. Recently, an efficient method by use of inference by sparse prior for a black-box objective function with binary variables has been proposed. In this method, a surrogate model was proposed in the form of a quadratic unconstrained binary optimization (QUBO) problem, and was iteratively solved to obtain the optimal solution of the black-box objective function. In the present study, we employ the D-Wave 2000Q quantum annealer, which can solve QUBO by driving the binary variables by quantum fluctuations. The D-Wave 2000Q quantum annealer does not necessarily output the ground state at the end of the protocol due to freezing effect during the process. We investigate effects from the output of the D-Wave quantum annealer in performing black-box optimization. We demonstrate a benchmark test by employing the sparse Sherrington-Kirkpatrick (SK) model as the black-box objective function, by introducing a parameter controlling the sparseness of the interaction coefficients. Comparing the results of the D-Wave quantum annealer to those of the simulated annealing (SA) and semidefinite programming (SDP), our results by the D-Wave quantum annealer and SA exhibit superiority in black-box optimization with SDP. On the other hand, we did not find any advantage of the D-Wave quantum annealer over the simulated annealing. As far as in our case, any effects by quantum fluctuation are not found.


Introduction
Black-box optimization is a method to optimize complex and expensive intractable functions, and functions without derivatives or explicit forms. Such functions appear in many problems in fields such as material informatics, 1) machine learning, 2) and robotics. 3) A systematic way to perform black-box optimization is Bayesian optimization. 4) In this method, data points are randomly chosen to generate a training dataset for inferring the black-box objective function. A regression model is then constructed to predict a relation between the input variables and the black-box objective function in the training dataset. Once the regression model is trained, an acquisition function is set up on its basis, which selects the next data point in a solution space from the trained model. The optimal solution of the acquisition function is used to evaluate the black-box objective function, and to obtain a new data point of it. When this value is evaluated, the regression model is retrained with new data. These steps are performed iteratively to pursue desired solutions, namely the optimal point of the black-box objective function.
Bayesian optimization is applied mostly to black-box objective functions with continuous variables, because the optimization of the acquisition function is relatively straightforward. It may be applied to black-box objective functions with discrete variables as well. A significant bottleneck appears in problems with discrete variables, where the resultant acquisition functions also contain discrete variables. It is generally harmful to solve acquisition functions with discrete variables. Optimization problems with discrete variables often belong to the NP-hard class. It takes extremely long time to solve them using any algorithms. In a previous study, Bayesian optimization of combinatorial structures (BOCS) 5) was proposed as a promising algorithm to evaluate the global minimum of black-box functions. In particular, a sparse prior was employed to efficiently perform regression in the Bayesian inference. The acquisition function was assumed as a quadratic unconstrained binary optimization (QUBO). Notably, relaxation to semidefinite programming (SDP) was used in the optimization phase, which can attain approximate solutions in a reasonable amount of time.
Recently, D-Wave Systems Inc. developed a device 6) that physically implements quantum annealing (QA). 7) It is a meta-heuristics to obtain the ground state of Ising spin glasses belonging to QUBO problems, and this device is now available commercially. Because various combinatorial optimization problems can be formulated as Ising models, 8) this D-Wave device has been used in the real world to solve a multitude of practical problems. [9][10][11] The device uses niobium rings as quantum bits (qubits) with programmable local fields and mutual inductance of two qubits, so that the device can solve QUBO problems. Solving a QUBO problem is equivalent to finding a ground state of an Ising spin glass, because binary variables can be rewritten as spin variables. The first stage of QA is initialized in the trivial ground state of the driver Hamiltonian. The quantum effect involved in the driver Hamiltonian is gradually turned off, and ends so that only the classical Hamiltonian with a nontrivial ground state remains. One of 3/22 the standard choice of the driver Hamiltonian consists only of the x element of the Pauli matrices called the transverse field. When the transverse field changes sufficiently slowly, the quantum adiabatic theorem ensures that we can find the nontrivial ground state at the end of QA. [12][13][14] Numerous reports 15,16) have shown that QA outperforms simulated annealing (SA), 17) which utilizes the thermal fluctuation and solves the combinatorial optimization problems. In context of machine learning, in which they solve various optimization problem in training, QA leads to a different kind of value in the output solution known as the generalization performance as in several literatures. [18][19][20] A previous study on black-box optimization using the D-Wave device has used the factorization machine, 21) which is used for recommendation systems and can be formulated in QUBO. They focused on metamaterial design, and evaluated the figure-of-merit in their metamaterial simulation. In the present study, we test the D-Wave quantum annealer in the black-box optimization by use of BOCS. In particular, the D-Wave quantum annealer does not necessarily output the ground state at the end of the procedure, partly because the connectivity realized in D-Wave 2000Q is a sparse structure called chimera graph. To embed a desired graph expressing the structure of the problem on the chimera graph, redundant qubits with chain structures are used to enhance the connectivity. This is because a single qubit possesses just six connections on average. We use a heuristic tool called minorminer 22) to embed the complete graph into the chimera graph. Since qubits in the same chain must have the same up or down direction of their magnetic moments, interactions between the qubits are inferred as ferromagnetic interactions. However, qubits in the same chain often do not have aligned magnetic moments, and this makes the solution undetermined. We resolve these broken chains by a majority vote of the directions. This is one of the reasons why the performance of QA in D-Wave 2000Q is unreliable. To achieve better performance of QA in D-Wave 2000Q, various techniques were proposed previously. [23][24][25] In addition, several techniques avoid many interactions between variables are proposed. 26,27) The performance of the D-wave quantum annealer is affected by the freezing effect, which appears because of a lack of sufficient quantum fluctuations for driving binary variables at the last stage of QA. 28) In addition, the thermal effect affects the dynamics of the spin variables as well as quantum fluctuation nontrivially. 29) Thus, the output is generally deviated from the ground state, especially for the hard optimization problems. Therefore, several protocols employ a non-adiabatic counterpart beyond the standard protocol of QA, [30][31][32][33] with the thermal effect. 34)

4/22
In BOCS, the fully-connected Ising model is set as the acquisition function. In general, the model includes a hard optimization problem. Thus, the resulting solution from the D-Wave quantum annealer is not necessarily the ground state. The deviation from the ground state is expected to affect the performance of BOCS. We investigate the effect from the quantum device, while comparing the performance of the BOCS when optimizing the acquisition function by SA. It is worth noting that SA does not always yield the ground state of the acquisition function depending on schedule decreasing a control parameter, temperature. Although, in the protocol of QA, we also tune the transverse field to control the quantum fluctuation. Therefore, the comparison roughly demonstrates the difference between thermal and quantum fluctuations. In the present study, we mainly focus on the performance of BOCS depending on the solver in the optimization phase of the acquisition function. We compare the results of BOCS by SDP, which leads to an approximate minimizer of the acquisition function and previously proposed in the original paper on BOCS 5) as a solver in the optimization phase, to those by SA and QA. Also in the literature on BOCS, the performance employing SA in the optimization phase was investigated. It was not necessarily hard to solve the black-box objective functions, which were random spin systems including Sherrington-Kirkpatrick (SK) model with decaying interactions in distance measured in indices of spin variables, which is essentially one-dimensional Ising spin glass with long-range interactions. In the original paper, the superiority of BOCS by SDP compared to one by SA was reported.
However, it is not enough to discuss the performance of BOCS by solving the problems appeared in the previous study. In the present paper, we change the problem setting into harder one, sparse SK model as the black-box objective function. In this sense, the setting is completely different from the original study. In addition, we utilize the D-Wave quantum annealer to perform the optimization in BOCS not only by SA in a classical computer because this is also a candidate of the solvers employed in BOCS.
In the procedure of BOCS, we have no information on the interaction strengths of the black-box objective function. In BOCS, we iteratively find lower-energy state of the acquisition function as a candidate of the optimal solution of the black-box objective function, while the coefficients in the acquisition function are changed.
In order to investigate the performance of the BOCS from a different perspective, we consider the case when the form of the black-box objective function is known. Then we may perform regression to infer only the coefficients to reveal the objective function. The inferred coefficients lead to a good approximation of the black-box objective 5/22 function. Then, we optimize the resulting approximate function and attain the good estimator of the minimizer of the objective function. In a previous study, the regression method was proposed only from pairs of spin configuration and the corresponding energy value. 35,36) The method works particularly well for Ising spin glass with sparse interactions. An analytical study using a sophisticated replica method and a numerical verification revealed a relationship between the number of the zero-value coefficients and the number of data needed to reconstruct all the coefficients. If the interaction matrix is sparse, fewer data are needed than the number of coefficients. In BOCS, similar to this regression method, we utilize the sparse prior distribution to infer the coefficients of the surrogate model, but the form of the objective function is unknown. In the present study, we set the SK model as the black-box objective function. We assume that the surrogate model takes the same form as the black-box objective function. In other words, the present case is that we know the form of the black-box objective function.
The comparison of the BOCS to the regression and optimization clarifies the performance to infer the sparse interactions of the black-box objective function. We measure the efficiency of the BOCS for the sparse SK model in terms of the necessary number of data, which corresponds to the number of iteration in BOCS, to find the optimal solution of the black-box objective function. In the previous study, although the BOCS was proposed as the black-box optimization technique for sparse interactions in the objective function, the performance of the BOCS depending on the sparseness remained unclear. To solve this problem, we change the sparseness of the SK models, and focus on the necessary number of data before finding the ground state by investigating the success rates in finding a minimum.
The remainder of this paper is organized as follows: we explain the BOCS method in Sect. 2, we discuss the numerical experiments in Sect. 3, we compare the performances between the BOCS and regression with a sparse prior in Sect. 4, and summarize in Sect 5.

Method
We assume that the input x is a vector of binary variables, its ith element is denoted by x i , and N is the dimension of x. Each x provides an observation y containing a finite error σ. Our goal is to find x that minimizes a black-box function. Since we cannot determine an explicit form of the black-box function, we employ a surrogate model and train it using the values of the black-box objective function. Any objective This huge amount of data cannot be collected in practical situations. We may thus cut the polynomials in finite orders. When the surrogate model contains higher-order terms than quadratic ones, we do not optimize it efficiently in general. In addition, the approximate surrogate model up to the second-order terms can be solved by SDP or the D-Wave quantum annealer efficiently in the optimization phase. We thus set a quadratic model as the surrogate model: where α 0 is a real value, and Q α is an upper triangular matrix.
In this algorithm, we compute a posterior distribution for the model parameters by following the framework of Bayesian inference. By sampling from the posterior distribution, we construct an acquisition function that indicates the next data point to choose from a solution space. We find the minimum of this acquisition function by using some optimization solvers. Then, we obtain the new data from the black-box function with the x value. This minimizes the acquisition function. We retrain the model with data, including the new value from the black-box objective function. This algorithm iteratively searches for the global minimum by updating the data points. Figure 1 shows a schematic drawing of the BOCS algorithm.

7/22
We assume a few data points are attained, because the evaluation of objective functions is expensive. We then consider the sparse Bayesian linear regression to take and observation noise σ. From observations of several data points { x (i) , y i } i=1,2,... , we compute a posterior distribution over α as: where we construct a matrix X from the vector In addition, p = 1 + N + N(N − 1)/2 and D represents the number of data points. Then, we set a likelihood function and a prior distribution over the parameter α. The likelihood function is given by a Gaussian distribution with a variance σ 2 as: Since the number of the elements of α is O(N 2 ), the amount of data also needs O(N 2 ) to estimate the regression parameters. Otherwise, we will obtain high-variance estimators.
To avoid getting uncertain parameters even when the data are scarce or the input dimension is large, we set a prior distribution. We here use a horseshoe distribution as the prior distribution to omit the hyperparameters to perform the BOCS. This prior distribution is capable to efficiently infer sparse parameters in the model even if the number of data is small: 37) where C + (0, 1) is the standard half-Cauchy distribution. This formulation, however, cannot realize efficient sampling. Following Makalic and Schmidt, 38) the half-Cauchy distribution can be expressed with the inverse-Gamma distribution to introduce auxiliary parameters. The inverse-Gamma distribution is a conjugate prior for normal distribution. The modified formulation is a closed-form from which we can efficiently sample parameters from this distribution with complexity O(p 3 ). We use a faster algorithm 39) (O(D 2 p)) that is exactly the same as that in the formulation with auxiliary parameters.
As tested in the previous study, one may compute α MLE by using a maximum The superiority of the sparse prior can be expected from sharpness of the shape of the prior distribution. From this point, the horseshoe prior has a nice property to infer the sparse parameters compared to the Laplace distribution.

Optimization Solver
Once Q α is fixed, we optimize the acquisition function (1)  Semidefinite programming We briefly describe the application of SDP in solving discrete optimization problems. We consider the following quadratic constrained problem in general as follows: y ∈ R n Here y i y j can be regarded as the (i, j)-element of the Gram matrix, whose eigenvalues are non-negative. This problem can be thus transformed to a SDP as minimize X Tr (CX) + d subject to Tr (D k X) = b k , k = 1, 2, . . . , K X 0 Here X 0 denotes that X is a semidefinite matrix, which has non-negative eigenvalues.
The resultant minimization problem is a convex optimization problem. Thus we readily attain the optimal solution. The minimum value of this convex optimization problem is the same as that of the original one. We use the property of SDP to attain an approximate solution of our acquisition function. The optimization of the acquisition 10/22 function is as follows: We then replace each binary variable x i with σ i = 2x i −1, and the minimization problem can then be written as: We relax the binary variable σ i to a vector y i on the N + 1-dimensional unit sphere. We can then rewrite eq. (8) as eq. (5)  In addition, the connectivity realized in D-Wave 2000Q is a sparse structure called chimera graph. We use a heuristic tool called minorminer, 22) to embed the complete graph into the chimera graph. Redundant qubits are formed into chain structures to realize a larger graph connectivity. While the magnetic moments of redundant qubits in a chain structure should take the same direction, they often take different directions. To

11/22
obtain a solution with these redundant qubits, the direction is adopted by a majority vote. This is also a reason why the performance of QA in the D-Wave 2000Q gets worse on dense graph problems. Another postprocess to find the better solution by fixing the broken chain is minimizing energy. This technique is often find the lower energy state than the results after majority vote. In the present study, we choose the majority vote to perform the benchmark without any further improvements from the default setting of using the D-Wave 2000Q.
Although there are a few reasons spoiling the output from the D-Wave 2000Q differing from the ground state, it can often yield the ground state in various types of optimization problems described in QUBO with a small number of variables.

Numerical Experiment
We perform numerical experiments employing the SK model as a black-box objective function. The SK model belongs to the NP-hard problem depending on the parameters described by spins σ ∈ {−1, 1} N and interactions J ij : The interaction coefficients are randomly selected as: The standard definition of the SK model is that the coefficient 1/N should be 1/ √ N .
However, we introduce a parameter to control sparseness ρ ∈ (0, 1], and set the coefficient as 1/N. If the parameter ρ is close to zero, the number of zero-elements in the J ij matrix increases. BOCS is expected to work well, because it implements the sparse prior. The performance of BOCS should be discussed in two ways. The first is inference.
The value of ρ affects the performance of inference in BOCS. This is independent of the optimization solvers. The other way is in optimization. The sparse connectivity of the Ising spin glass set in the black-box objective function affects the performance of the optimization solvers. In this study, we use SDP to quickly generate the approximate solution of the acquisition function. SA and QA on the D-Wave quantum annealer do not always reach the ground state but directly solve the acquisition function. Notice that we perform iterations in SA in fixed steps during the optimization phase. In addition, QA on the D-Wave quantum annealer performs only in a fixed amount of time.
In this study, we treat SK models with N = 20 spins, and 10 initial datasets  by SA and D-Wave focuses on the difference between the thermal and quantum fluctuations. However, we did not find difference between thermal and quantum fluctuations in our experimental setup.
We hereafter focus on the comparison between the results by SDP and SA. The difference between the two cases is mainly the deviation from the tentative ground state of the acquisition function. We simply assume that the BOCS by SDP falls into a local minimum in the acquisition function at each step, which explains why the BOCS by SDP gets worse performance. In the BOCS the balance between exploration and exploitation of the Bayesian inference is important. As we introduce the sampling technique in BOCS for increasing the exploratory property inspired by the Thompson sampling, the BOCS by SDP sometimes approach the optimal solution. However, the SA (and D-Wave) outperforms SDP in our problem setting, due to the better performance to attain the lower-energy state by SA (and D-Wave) of the acquisition function. In our problem setting, we employ the sparse SK model. Thus the acquisition function also takes the similar spin glass problem during the process of BOCS. This is one of the reasons why SDP shows worse performance compared to SA (and D-Wave).
The difference between SDP and SA (and D-Wave) becomes small as the ρ value increases. This implies that the exploratory space gets narrower as ρ takes higher values.
In other words, the solution space is divided into the states around many deep local minima, and thus, the exploratory space cannot be sufficiently broadened even by using thermal or quantum fluctuations. The difficulty in solving the hard problems appear in the performance in BOCS.
In order to investigate the dependence of the performance of BOCS on ρ, we plot a success probability of finding the global minimum in Figure 3. Regardless of the optimization solvers used, the number of iteration steps before finding the minimum value becomes large as ρ increases. The number of iteration steps consists of the number of data points used in BOCS and the number of duplication. Moreover, we replace the number of iteration steps with the number of data points as shown in Figure 4. Then, we find that the same dependence on ρ as that in Figure 3.

Comparison with Regression
If the form of the black-box objective function is known a priori while its coefficients in the quadratic form and some parameters are unknown, one may infer only the coefficients from the regression data attained. This means that once the required number of data points is collected at random, one can reconstruct the coefficients, and the minimum solution can be obtained with an appropriate optimization solver. To reconstruct the coefficients, we solve the following equations: whereJ is the coefficient vector, its element is J ij , S is the spin-data matrix, the ith spin data vector is denoted by S (i) = [σ If SA (and D-Wave) is used for the optimization solver in a complicated problem, BOCS is more likely to find the minimum value with a relatively small number of data points. Under certain situations, the BOCS occasionally yields better performance than the typical reconstruction limit that was revealed in the previous study. We propose a few reasons that explain this observation. The first is that the BOCS only focuses on finding the ground state. The regression does not directly derive the ground state.
The typical reconstruction limit is the performance of inferring the correct coefficients of the black-box objective function. Once we find the correct coefficients, we can get the exact ground state of the black-box objective function. One might find the ground state from approximate values of the coefficients when the number of data points is insufficient. However, below the typical reconstruction limit, a drastic change in the coefficients appears compared to the correct coefficients, because the system undergoes phase transition. Therefore, we cannot optimistically expect that the ground state of 16/22 the black-box objective function is attained. Another reason is due to the difference between the sparse priors. In the previous study, the L 1 norm was used for inference of the sparse coefficients In addition, the hyperparameter for the Laplace distribution was, in some sense, optimized in their analysis. However, BOCS utilizes the horseshoe distribution, which may infer the sparse coefficients more efficiently.
Notice that the results seem to be beyond the theoretical reconstruction limit as ρ > D/M. This would be a finite-size effect. We are not arguing that our results suggest any advantage of BOCS beyond the theoretical reconstruction limit in the region where ρ takes a relatively large value. That being said, the possibility remains that BOCS, by making use of the horseshoe prior, reaches the theoretical reconstruction limit. We will be investigating this further in a future study.

Summary and Future Directions
Black-box optimization aims at reducing the value of objective functions that are expensive to evaluate, and has broad applications in fields such as machine learning and robotics. In the present study, we tested BOCS by setting the SK model as a black box objective function, and evaluated its minimum value. In the optimization phase of showed better performance than those of SDP in the SK model. This is possibly due to the degree of deviation from the ground state of the tentative acquisition function.
We also compare the number of required data points to find the minimum value of the black-box objective function by BOCS and regression with the L 1 norm. Although data points than in regression with the L 1 norm, when the black-box has a dense structure. Although the L 1 norm is used to perform regression to infer the sparse parameters, we employed the horseshoe prior in BOCS. This is possibly explained by the difference of the priors used, and will be investigated in a future study. We emphasize that our problem setting is slightly different from regression under the assumption of the form of the black-box objective function. We search for the minimum without knowledge on the details of the structure of the cost function. In addition, our results suggest that our approach can find only the minimum more efficiently, compared to the case of the regression with the L 1 norm finding the parameters to express the objective function.
In this study, we set the SK model as the black-box objective function, which is of the same form as the acquisition function. In other words, the black-box objective function can be expressed by the acquisition function in principle. The performance of BOCS has not been sufficiently investigated when the black-box objective function is not of the same form as the acquisition function. We will also investigate this further. In addition, notice that, although we here choose the SK model as the benchmark test, the application of BOCS is not restricted on the objective function with two-body interactions.
Beyond two-body interactions, the BOCS is available in principle. The performance of BOCS diminished by mismatch between the objective function and surrogate model will be the next scope along the same line as the present study.
One of the reasons to employ the D-Wave quantum annealer as the optimization solver is utilizing quantum fluctuations. In order to investigate the quantum fluctuations, we may use the quantum Monte-Carlo simulations. In addition, we may investigate the nontrivial effect of the quantum fluctuations, known as the non-stochastic Hamiltonian. [41][42][43][44][45] However, longer times are required for each optimization. Thus, we consider using the approximate message-passing algorithm depending on the form of the acquisition function. 46) However, in the present study, there is no significant difference between thermal and quantum fluctuation to find approximate solution of the acquisition function. Thus we may employ other Ising solvers such as the CMOS annealer, 47) the Fujitsu Digital Annealer, 48) TOSHIBA simulated bifurcation algorithm 49) and the FPGA for performing quantum Monte-Carlo simulation efficiently. 50) In addition, the current D-Wave 2000Q performs hybrid computation up to 20, 000 variables on a complete graph. By using these solvers and employing BOCS, we compute more complicated optimization problems with a large number of variables beyond our investigations in the present study.

18/22
acquisition functions in the BOCS algorithm, because D-Wave devices return nearoptimal solutions within a few microseconds. One of the topics for future work will be finding a surrogate model suited for a distribution generated from D-Wave devices.
We conclude that difference between thermal and quantum fluctuations is not observed in our results. However, the sampling from the D-Wave quantum annealer actually generates the output affected by the quantum fluctuation. Therefore, more suitable applications of the quantum device in the framework of Bayesian inference should be considered. This is another future research problem.