Analyzing the Effectiveness of Quantum Annealing with Meta-Learning

The field of Quantum Computing has gathered significant popularity in recent years and a large number of papers have studied its effectiveness in tackling many tasks. We focus in particular on Quantum Annealing (QA), a meta-heuristic solver for Quadratic Unconstrained Binary Optimization (QUBO) problems. It is known that the effectiveness of QA is dependent on the task itself, as is the case for classical solvers, but there is not yet a clear understanding of which are the characteristics of a problem that makes it difficult to solve with QA. In this work, we propose a new methodology to study the effectiveness of QA based on meta-learning models. To do so, we first build a dataset composed of more than five thousand instances of ten different optimization problems. We define a set of more than a hundred features to describe their characteristics, and solve them with both QA and three classical solvers. We publish this dataset online for future research. Then, we train multiple meta-models to predict whether QA would solve that instance effectively and use them to probe which are the features with the strongest impact on the effectiveness of QA. Our results indicate that it is possible to accurately predict the effectiveness of QA, validating our methodology. Furthermore, we observe that the distribution of the problem coefficients representing the bias and coupling terms is very informative to identify the probability of finding good solutions, while the density of these coefficients alone is not enough. The methodology we propose allows to open new research directions to further our understanding of the effectiveness of QA, by probing specific dimensions or by developing new QUBO formulations that are better suited for the particular nature of QA. Furthermore, the proposed methodology is flexible and can be extended or used to study other quantum or classical solvers.


Introduction
In recent years the field of Quantum Computing has gathered significant popularity, thanks to remarkable advancements that led to the development of several quantum computers of different architectures and technologies that can be used to tackle numerous problems.Although quantum computers are still limited both by their relatively small size and by the noise that limits the precision of the computation, the field is rapidly moving forward.Among the existing Quantum Computing paradigms, Quantum Annealing (QA) is a meta-heuristic that can be used to solve Quadratic Unconstrained Binary Optimization (QUBO) problems, a family of NP-hard optimization problems.The key idea of QA is to represent a QUBO problem as an energy minimization problem of a real and configurable quantum device.To do so, the problem variables are mapped onto physical quantum bits, or qubits.The quantum device is steered towards a state of minimal energy, called ground state, with a controlled evolution.The ground state corresponds to the optimal solution of the original QUBO problem.The devices that implement the QA process are called Quantum Annealers.
An important issue is that the quality of the solutions found by QA is limited by multiple factors.First of all, Quantum Annealers are physical devices which have a limited number of qubits and connections between them.This limits the size of the problems that they can tackle and requires to process the QUBO problem adapting it to the physical structure of the Quantum Annealer.A second important aspect is that the quality of the solutions found by QA depends on the behaviour of the underlying physical quantum system, which is very difficult to study.It is known that some problems appear to be more difficult to solve with QA [20][21][22], but understanding why is not a trivial task and still an open research question.
Most of the previous studies on QA compare its performance, in terms of required time for computation with respect to other heuristic solvers, rather than on the quality of the solutions it finds, i.e., its effectiveness.There are two ways in which one can study the effectiveness of QA, one is by analytically describing the underlying quantum behaviour and the other is to perform empirical experiments.A theoretical analysis has been performed for very small QUBO instances [23], which however are too simple to assess the effectiveness of QA when compared to other classical solvers.Furthermore, analytically analysing such a quantum system becomes rapidly very expensive and is generally impossible for problems of interesting size.On the other hand, the existing empirical studies on the effectiveness of QA have explored much larger problems but focus mainly on specific tasks such as feature selection [4], clustering [7,8] and classification [2,5,6], and therefore lack generality.
To the best of our knowledge, there is no published research which has investigated extensively how the characteristics of the problem impact the effectiveness of QA.For this reason, in this study we propose a novel empirical methodology for the analysis of the effectiveness of QA, based on the study of the characteristics of QUBO problems with a meta-learning approach.The general idea consists in generating many QUBO instances, defining a set of features which can describe them and train meta-models to predict whether QA would solve that problem or not.Our key contributions are as follows: • The design of an experimental methodology which can be applied to study the effectiveness of QA.This methodology can be used also for other quantum algorithms, such as QAOA [24] or VQE [25]; • The selection of ten classes of optimization problems, each one with specific characteristics, from which we generate approximately five thousand QUBO instances; • The design and the generation of a meta-learning dataset, which contains for each of the five thousand instances a selection of a hundred features based on probability theory, statistics and graph theory.We show that using them it is possible to effectively predict whether QA would solve a problem instance effectively or not.
We share the meta-learning dataset online for further research; • The analysis of the features of a QUBO problem with the strongest impact on the effectiveness of QA; 2 Background

QUBO and Ising Models
In order to use Quantum Annealing (QA) to tackle optimization problems, these should be represented with one of two equivalent formulations called QUBO and Ising, suitable for NP-Complete and some NP-Hard optimization problems [26,27].While the two are equivalent, the QUBO formulation is closer to traditional Operations Research, the Ising formulation is instead closer to Physics.The objective function in the QUBO model is given by Equation 1, where x ∈ {0, 1} n is a column vector representing the assignment of the binary variables x 1 , x 2 , ..., x n , n is the number of problem variables, y the cost, and Q ∈ R n×n is a real square matrix, either symmetric or upper triangular.
We will refer to combinatorial optimization problems written in the QUBO formulation as QUBO problems.Note that the QUBO formulation does not allow for hard constraints.An optimization problem with constraints can be transformed into a QUBO problem by introducing a quadratic penalty term multiplied by a penalty coefficient p.The idea is that the hard constraints are transformed in soft constraints, such that if they are violated a positive penalty p is added to the cost function making the cost of that variable assignment worse.Note that by using soft constraints we do not have the guarantee that the optimal solution will satisfy the constraints, which may happen frequently if the penalty coefficient p has a value that is too low.In general, a quadratic binary optimization problem with equality constraints formulated as Ax − d = 0, where d ∈ R m and A ∈ R m×n , can be transformed into the following QUBO problem: If the quadratic binary optimization problem also has inequality constraints, those need to be transformed first into equality constraints using binary slack variables.For example, if we have the following constraint: we can transform it into an equality constraint by introducing the binary slack variables x 4 and x 5 : x 1 + 2x 2 + 4x 3 + x 4 + 2x 5 = 3 There exist no general rule to choose the best number of slack variables, so multiple strategies can be followed.
A second useful formulation is the Ising model, which was developed to describe an energy minimization problem for a system of particles [26,27].The objective function of the Ising model is given by Equation 3, where s ∈ {−1, 1} n is the column vector representing the assignment of the n problem variables s 1 , s 2 , ..., s n also called spin variables, J ∈ R n×n is the coupling matrix that describes the quadratic terms of the objective function and has zero diagonal, b ∈ R n is the bias vector, which contains the linear terms of the objective function.The constant term c ∈ R is called offset.
A QUBO problem can be transformed into an Ising problem through a linear mapping of the variables.In particular, a binary variable x i is transformed into a spin variable s i according to the following conversion 1 :

Quantum Annealing and Quantum Annealers
Quantum Annealing (QA) is a meta-heuristic solver for QUBO problems.It is based on the Adiabatic Quantum Computation (AQC) paradigm, with some relaxations [20,[28][29][30][31].The idea is to represent the optimization problem as an energy minimization one, and then use a configurable device that exhibits the needed quantum behaviour to minimize it.Such a device, the Quantum Annealer, is composed by multiple qubits connected between each other.QA works based on a time evolution of the quantum system.The initial state of the system is a default one, easy to prepare, so that the qubits are in a state of minimal energy, i.e., the ground state.Then, the physical system with the difference that in our case a binary 0 is mapped onto spin 1 and binary 1 is mapped onto spin −1.
is evolved slowly over a short amount of time by introducing a dependency on the Ising coefficients of the problem one wishes to solve.This means, for example, slowly changing the magnetic fields the qubits are subject to.At the end of the evolution, the physical system will depend only on the problem and, if the evolution was careful enough, it will still be in the ground state.Since the state of minimal energy is also the solution of the optimization problem, measuring the state of the qubits will yield the values that the problem variables should have.
The evolution of the system in QA occurs in a noisy environment and is subject to quantum fluctuations, i.e., quantum tunneling, which helps it explore the solution space.The noise of the system and the duration of the evolution influence the results of QA, if the evolution is too fast the system will likely escape its ground state and find a worse solution, while if the evolution is too slow noise may build up and push the system again out of the ground state.Due to its stochastic nature, QA acts as a device sampling low-cost solutions in a similar way as other classical solvers do, such as Simulated Annealing.For this reason, QA is repeated multiple times in order to obtain samples of the final state of the quantum system.
The physical devices that implement QA are called Quantum Annealers.Currently, D-Wave Systems Inc. is the company that provides the Quantum Annealers with the largest number of qubits2 .For example, the D-Wave Advantage has more than 5000 qubits with a topology called Pegasus, where each qubit is connected to other 15 ones.
Solving a QUBO problem with a Quantum Annealer requires the following steps: 1. Formulate the problem as a QUBO or an Ising problem: the coefficients that are needed to configure the Quantum Annealers are those of the Ising formulation, as such the problem needs to be in this form.If the problem has a simpler formulation as a QUBO, the transformation is straightforward.Note that some problems can be formulated as QUBO or Ising easily, while others require more expensive processing.

Embed the problem on the topology of the device: since the Quantum
Annealer is a physical object, we must fit the problem we want to solve on it, accounting for the limited number of qubits and of the connections between them.This procedure is called minor-embedding [32] and maps each problem variable to one or more qubits.If multiple qubits are needed to represent a single problem variable, that is called a qubit chain.If the problem has a large number of quadratic terms, a substantial number of qubits may be needed to create all the physical connections.Figure 1 shows an example of how a simple problem can be mapped on a Quantum Annealer.Minor-embedding is an NP-Hard problem but polynomialtime heuristic algorithms are available [33][34][35] 3 .3. Evolution of the system and sampling of the solutions: once the minorembedding is done, the problem is transferred to the Quantum Annealer.First, the device is programmed with the problem coefficients, then we can perform a sequence of multiple evolutions to obtain the desired number of samples n s .Each sample requires three steps: (i) the evolution is run for the desired duration, called annealing time t a , (ii) the final state of the system is measured, which requires a read-out time t r dependent on the number of qubits used, and (iii) the device pauses shortly for cooling.
More formally, the energy of a system can be modeled with an Hamiltonian, H ∈ R 2 n ×2 n , and the evolution that occurs in QA is described by the time-dependent Hamiltonian H(t) that models the transition from the initial default Hamiltonian H i 4 and the Hamiltonian describing the problem H p : The coefficient A(t) decreases as the evolution progresses, while B(t) increases introducing the dependency on the characteristics of the problem, but their exact values depend on the hardware.At the beginning of the evolution B(t) is zero, while at the end A(t) is zero.Note that this is just a description of the underlying physical system and there is no need to compute this representation to use QA.
In the ideal Adiabatic Quantum Computing setting, it is possible to compute the exact annealing time needed to ensure the system remains in the ground state and finds the global optimum, this result dates back from a century ago [36].This optimal annealing time is inversely proportional to the smallest difference between the two smallest eigenvalues λ 1 (t), λ 2 (t) of H(t).Such difference is called minimum gap.Although this result may be useful to understand its behaviour, it is not applicable to QA because it is subject to noise.Furthermore, computing the eigenvalues of H(t) is prohibitive for all but the smallest problems.
To exemplify how this representation works, assume to have an Ising problem of n variables with coupling J and bias b, H p is a 2 n × 2 n matrix computed as follows: The matrix σ z is the Z-Pauli operator σ z acting on qubit i: with ⊗ being the tensor product and I the identity matrix.A useful property of H p is that it is a diagonal matrix that contains all the cost values for all possible variable assignments of the problem.Since it is diagonal, these values are also its eigenvalues and the corresponding eigenvectors encode the variable assignment that has that cost.The minimum eigenvalue of H p corresponds to the minimal cost and the corresponding eigenvector to the optimal variable assignment.
As an example, consider the following QUBO problem, which is minimized when The equivalent Ising formulation is: For this small instance we can compute H p easily.The matrices σ (1) z and σ (2) z are: H p is then equal to: The smallest eigenvalue of H p is λ min = − 1 2 and has a multiplicity of two corresponding to the first and last eigenvalues.Indeed, both x 0 = 0, x 1 = 0 and x 0 = 1, x 1 = 1 are optimal solutions of the problem.If we sum λ min with the offset of the Ising problem, 1 2 , we obtain 0, that is the same value of the QUBO cost function when x 1 = x 2 .

Studies on the Effectiveness of QA
Most of the previous studies on QA focus on its performance, by measuring the time required to solve a problem and comparing it to that of classical solvers.To the best of our knowledge there is no consensus on whether QA provides a general and consistent speedup compared to other traditional solvers for QUBO problems [20,31,37], while a recent paper claims substantial speedup for a quantum simulation task [38].Although some papers may claim a speedup, this is often based on measurements that only account for part of the process.Indeed, one should consider the time required by all phases: (i) formulating the optimization problem as QUBO or Ising, (ii) embedding the problem on the QA, (iii) sampling the solutions on the device and (iv) postprocessing the results if needed (for example by checking if the constraints are satisfied).Frequently, the efficiency of QA is measured by only accounting for the usage of the quantum device itself (programming time and the repeated annealing and read-out) while ignoring the time needed for minor embedding and for creating the QUBO formulation.This gives an incomplete picture of the technology that does not account for two significant bottlenecks.For example, it may be that in a certain situation QA is faster than other traditional methods in solving a specific QUBO problem, but that may not be the case anymore if one includes the minor embedding phase.Furthermore, if it is very computationally expensive to formulate the problem as QUBO, it may be more efficient to use other traditional methods that do not need a QUBO formulation at all.
When comparing the quality of the solutions found by QA and classical solvers, i.e., their effectiveness, the published literature usually focuses on problems related to specific fields or even to very specific instances of those problems.Due to this, there is still a limited understanding of how would QA compare in a more general setting.For example, the effectiveness of QA has been analyzed for feature selection [4], classification [2,5,6] and clustering [7,8], which are typical machine learning tasks.In the field of chemistry, QA has been applied and analyzed to find the equilibria of polymer mixtures [12], to find similarities between molecules [13] and to find their ground state [14].The effectiveness of QA in solving problems related to logistics has been analyzed too, for example in solving the Nurse Scheduling Problem [16] and in optimizing the assignments of the gates at the airport [19].
Previous research also studied the effectiveness of QA from a theoretical perspective by representing analytically the evolution of the time-dependent Hamiltonian H(t) (see Equation 4) and computing the probability of escaping the ground state [23].This approach is however limited by the fact that the size of the Hamiltonian grows exponentially on the number of QUBO problem variables n, and the analytical analysis of the Hamiltonian becomes rapidly impractical for all but the smallest problems.An alternative way is to adopt an empirical approach, by using the outcome of multiple experiments to probe the underlying physical system [39].The idea is to allow the evolution to progress up to a certain intermediate stage and then drastically accelerating it (i.e., a quench, according to D-Wave terminology), observing how the effectiveness changes based on when was the evolution accelerated.While this approach allows to tackle large problem instances, applying the acceleration at different stages of the evolution requires to repeat the experiment a large number of times and therefore this approach too is very resource intensive.
To overcome the limitations of the methods adopted in the literature, this paper proposes a new empirical approach to study how the quality of the solutions found by QA is impacted by the characteristics of the problem.To achieve this, we first collect a dataset of problem instances belonging to 10 selected problem classes and solve them using both QA and three classical solvers.Then, for each problem instance, we compute a set of features describing various characteristics, from the distribution of the bias coefficients of its QUBO formulation to the topology of the graph that describes the instance once it has been embedded on the Quantum Annealer.Using this dataset we train a machine learning classifier to identify whether QA was able to find a good solution for that instance and, finally, use it to assess which are the most important problem features.

Meta-Learning Dataset Generation
In this section, we present the methodology used to generate our meta-learning dataset, on which we train the meta-models to predict the effectiveness of QA.We publish this dataset online for future research.First, we describe how we select the ten classes of problems we want to solve with QA and the strategies we use to generate the five thousand instances.Then, we describe how we evaluate the effectiveness of QA, in terms of closeness to the optimal solution of the problem and by comparing QA with other classical methods (Simulated Annealing, Tabu Search and Steepest Descent).Third, we describe the representations of the QUBO problem we used to compute the approximately one hundred features used to train the meta-models.Finally, we describe how we solve the instances with QA and with the classical solvers, with a particular focus on the choice of the optimal hyperparameters of the solvers.

Selection of Problems and Instances
We identify a selection of ten different optimization problems that exhibit different characteristics: some have constraints, others do not; some have linear terms, others do not; some have a large number of quadratic terms while others do not, etc.The details on their formulations are reported in Appendix A and the details on the generation of the instances are in Appendix B.
The first group contains five classes of optimization problems defined over a graph: Max-Cut, Minimum Vertex Cover, Maximum Independent Set, Max-Clique and Community Detection.They were selected for the following reasons.Both the Max-Cut and Community Detection problems have a straightforward QUBO formulation that does not require penalties to represent constraints.The Max-Cut, Maximum Independent Set and Minimum Vertex Cover problems share the same quadratic terms in their QUBO matrix, but not the diagonal (i.e., the linear terms or bias).The Max-Clique problem is formulated as a Maximum Independent Set problem but defined on the complement graph.The Community Detection problem has a very dense QUBO matrix as there are quadratic terms between all variables and is a relevant problem in Machine Learning [40].Since these problems are formulated on a graph, we apply them on four different graph topologies: Erdös-Renyi, Cyclic, Star and 2d-grid.Note that in order to have a diversified set of instances we introduce small random perturbations to each topology, consisting in few edge insertions and deletions.The number of insertions and deletions depends on the number of nodes of the graph.More details are reported in Appendix B.1.
The second group of five optimization problems contains: Number Partitioning, Quadratic Knapsack, Set Packing, Feature Selection and 4 × 4-Sudoku.These are a more heterogeneous set than the previous graph-based problems ad so require ad-hoc strategies to generate their instances which we detail in Appendix B.2. Similarly to the Max-Cut and Community Detection problems, the Number Partitioning problem has a straightforward QUBO formulation with no penalty terms to represent constraints.Similarly to the Community Detection problem, the Feature Selection problem has a dense QUBO matrix with quadratic terms between all variables.Finally, the Quadratic Knapsack, Set Packing and 4×4-Sudoku problems are all Constraint Satisfaction Problems, each with different types of constraints.In particular, the Quadratic Knapsack problem has inequality constraints that need to be converted in equality constraints using slack variables.
We generate multiple instances of all the problem classes we selected.Concerning the size of the problem instances, measured in the number of problem variables, there are two constraints to take into account.First, the D-Wave Quantum Annealer that we use has more than 5000 qubits but, due to their limited connectivity, it is generally possible to tackle problem instances up to between 100-200 variables depending on the structure of the QUBO problem.This is due to the minor-embedding phase.Second, we want the instances to be representative of problems that are not trivial and with a Hamiltonian that could not be analyzed analytically.In order to provide a more complete picture, we are also interested to assess the impact of the distribution of the solution space of the problem.This, formally, corresponds to the set of eigenvalues and eigenvectors of the Hamiltonian of the problem H p (see Section 2.2).Unfortunately, it is impractical to compute them for instances of more than 32 problem variables, which may be too small and easy to allow a comparison on the effectiveness of different solvers.For this reasons we decided to create two separate sets of instances: • One set of large instances, with 5114 instances of between 69 to 99 variables, the upper range of what can be tackled with the QA; • One set of small instances, with 246 instances of between 27 to 32 variables.With this set of instances we can do a more complete analysis which includes also the distribution of the solution space.
The number of problem instances for each optimization problem is summarized in Table 1.Notice that the instances of the 4 × 4-Sudoku problem are included only in the small instances set, since the largest possible instance is unique and it has at most 64 variables.All the generated instances are satisfiable and, when needed, the penalty term coefficient p used in the QUBO formulation is optimized with a Bayesian Search5 [41,42], in order to maximise the number of feasible solutions for Simulated Annealing.

Evaluating the Effectiveness of a Solver
In this section, we describe how to evaluate the effectiveness of a solver and, in particular, of QA.Both QA and the traditional solvers we compare it to are stochastic and are executed multiple times to obtain a set of variable assignments that aim to minimize the cost function, which we call a set of samples.A sample is represented by an assignment of the decision variables x and by the related cost value y.
We solve all instances with QA, Simulated Annealing (SA), Tabu Search (TS) and Steepest Descent (SD).Our definition of how much a solver is effective is based on whether it finds samples that meet some quality constraints.While for the small instance set it is possible to compute the global optimum, for the large ones it is not feasible to do so and therefore we define the effectiveness in relative terms with respect to the other solvers.
In particular, we evaluate the effectiveness of QA on the large instances set by comparing its samples with those of the traditional heuristic solvers.We define the samples associated to the best cost value for a solver S as y S min .An instance I is QAover-all if the best solution found by QA is at least as good as the best one found by SA, SD, TS combined.More formally, if y QA min ≤ min {y SA min , y T S min , y SD min }.Comparing QA with a pool of multiple solvers results in a stricter evaluation of its effectiveness, but the condition that QA has to be at least as good as all the other solvers combined may be too strict.For this reason, we also compare QA with each individual solver.An instance I is QA-over-S if the best solution found by QA is at least as good as the one found by solver S, hence y QA min ≤ y S min .For the small instances we can perform a deeper analysis of the effectiveness because we can explore the full solution space and find the global optimum.This is in practice done by computing the Hamiltonian of the problem, H p , which is a diagonal matrix enumerating the eigenvalues λ, sometimes called energy, of all the variable assignments.The eigenvalue is equivalent to the cost function y but does not include possible constant offsets c from the Ising formulation, therefore y = λ + c.The variable assignment x associated to an eigenvalue λ can be computed starting from the corresponding eigenvector of H p .The global optimum of an instance is the assignment x min corresponding to the minimal eigenvalue of H p , λ min .We will refer to the maximum eigenvalue as λ max .
We define a sample with energy λ as ǫ-Optimal if the following condition holds: The ǫ-Optimality condition describes how close is the eigenvalue of a sample to the solution of the instance.The coefficient ǫ ∈ [0, 1] allows to restrict the interval under which λ is considered close enough to the optimal eigenvalue λ min .Notice that if ǫ = 0 only the global optimum of the instance meets the constraint in Equation 8.
We also define a sample x as Hamming-Optimal (h-Optimal) if it differs from any solution x min in at most one decision variable.This corresponds to check the Hamming distance between a sample and a solution of the instance:

Meta-Learning Features
In this section, we introduce the features we define to describe a problem instance.We rely on a selection of metrics used in statistics and probability theory, such as the Gini coefficient [43], the Herfindahl-Hirschman index [44] and the Shannon entropy [45], as well as metrics used in graph theory, such as the spectral gap, the radius a graph, its diameter and its connectivity.In total we compute 107 features, which we describe in detail in Appendices C and D. The features we compute can be grouped in multiple domains of analysis.Overall we identify seven domains, among which we describe the three most relevant ones: For the LogIsing and EmbIsing domains we compute several features on different mathematical objects, such as the coupling matrix J, the Laplacian matrix of the corresponding graph and the bias vector.We call such objects components.
We can also identify sets of features that refer to the same mathematical object, but are computed on different domains.For example, both LogIsing and EmbIsing domains include features computed on the bias.We refer to them as component sets and allow us to perform an analysis of the importance of those mathematical objects that is orthogonal to that of the domains.We identified the following component sets: Coupling, Bias, Laplacian, Structural Adjacency (StructAdj), Structural Laplacian (StructLap), where StructAdj and StructLap gather features related to the binarized versions of the coupling and Laplacian matrices.

Hyperparameter Optimization of the Solvers
Since the goal of this study is to compare the effectiveness of different solvers, it is essential to ensure that each solver is using the best hyperparameters.Indeed, it is well-known in many fields that comparing methods that are not consistently optimized leads to inconsistent results that cannot be used to draw reliable conclusions [46,47].The same applies in our case.
We optimize the hyperparameters of each solver (QA, SA, TS, SD) on the instance with the largest minor-embedding on D-Wave Quantum Annealer for each optimization problem class.The goal is to identify the hyperparameters that will lead the solver to find the variable assignment with the lowest cost y.Once the optimal hyperparameters have been found, they are used to solve all instances of the corresponding problem class.We optimize separately the hyperparameters used for the large and small instances sets.To optimize the hyperparameters of the classical solvers we use the standard QUBO formulation while for QA we use the embedded QUBO formulation: we followed this strategy because the embedded QUBO formulation is required only for QA.In this way, we have a fair comparison between different solvers since, for each one of them, we take into account only the necessary steps to solve a QUBO instance.

Optimal Hyperparameters of Quantum Annealing
QA has several hyperparameters that can be optimized 6 , some of which refer to the evolution process as a whole while others allow to fine-tune it at the level of each individual qubit.The access to the D-Wave Quantum Annealers is limited and for such a large set of instances we have devised a methodology to optimize the hyperparameters we believe are the most important: the annealing time t a and the number of samples n s .In order to perform an efficient optimization within the available resources, we define a fixed computational budget T for each instance.Using the default annealing time, 20µs, and drawing 100 samples requires, in the worst case, at most 37 ms.In our experiments we allocated T = 70 ms and T = 300 ms per each problem instance.
The optimization is performed by iterating over 10 values for t a , approximately equidistant from each other, between 5 µs and 200 µs.Given t a , the number of samples n s is computed as the maximum value allowed within the computational budget T , according to Equation 10.For T = 70 ms, n s is between 145 and 537 while, for T = 300 ms, n s is between 766 and 2826.
The term t p ≃ 15 ms is the time needed to program the instances on the Quantum Annealer, t r is the read-out time, needed to read the results of the annealing process, and ∆ ≃ 20 µs is the delay applied after each read-out operation.The read-out time t r is unknown a-priori because it depends on the size of the embedded problem.Based on empirical observation we use t r = 75 µs for small instances and t r = 150 µs for large instances.We choose t a and the related n s which provide the sample with the lowest energy.If for multiple pairs (t a , n s ) QA finds samples with the lowest energy, we choose the pair with the smallest t a .
Since the results we obtained when using both computational budgets are very similar, we report those for T = 70 ms.The selected hyperparameters are reported in Table 2. Notice that the annealing time t a for the large instances is often smaller than for the small instances.This highlight that, for large instances, a larger t a does not improve the effectiveness of QA, at least in the range of values we considered and for the number of samples it allows to draw.Such result suggests that QA may require an additional optimization, for example, of the annealing schedule, which is not straightforward and it goes beyond the scope of this study.

Optimal Hyperparameters of the Classical Solvers
The hyperparameters of Simulated Annealing (SA), Tabu Search (TS) and Steepest Descent (SD) are optimized with the following procedure.For the optimization of these methods we do not use a fixed computational budget because the technology is fundamentally different and, due to the various stages required by QA, it is not trivial to define such a comparison in a way that is fair.First, we fix the number of samples to n s = 200 which is a value comparable to that used for QA.For half of the the large instances QA uses more samples than the classical solvers, while for the remaining half the opposite is true.We optimize the hyperparameters with a Bayesian Search of 100 iterations [41,42].The results are available in Appendix E. For TS, we optimize the number of restarts of the algorithm and the initialization strategy.For SD, there are no hyperparameters to optimize, except for the number of samples, which we have already set.For what concerns SA, we optimized the number of sweeps7 , the schedule8 and the initial state generator.We noticed however that hyperparameters we found for SA produced worse results compared to the default ones in our following analysis, which may be due to the sensitivity of SA to some of them.For this reason, we retain the default hyperparameters of 1000 sweeps, a geometric beta schedule and a random initial state generator.

Meta-Model Training and Optimization
In this study we aim to identify which are the characteristics of a problem that impact the effectiveness of QA.We do this by first training a classification model on the dataset we have created in order to predict whether QA would solve that instance well or not based on its features.Since the classifier is trained to predict the outcome on another experiment, it is called a meta-model.Once the meta-model is trained, we can use it to probe how important are the various features.
We train the meta-models with Random Forest, AdaBoost, XGBoost and Logistic Regression, using as input data either a specific domain (e.g., LogIsing, EmbIsing, SolSpace) or a specific component set (e.g., Bias, Coupling, Laplacian), which are described in Section 3.3.The target labels are described in Section 3.2 (i.e., Optimal, ǫ-Optimal, h-Optimal) and they are binary, according to whether the solver meets that effectiveness condition or not.
The first step is to train the meta-model and optimize its hyperparameters to ensure it is effective in predicting the label.In order to measure the effectiveness of the meta-models, we have to account for the significant class imbalance of the labels towards the negative class, i.e., instances that are not solved well by QA (see Section 5).We use Balanced Accuracy (BA) to evaluate the meta-models because it is robust to class imbalance.Given the true positives as T P , the true negatives as T N , the number of positive labels in the data as P and the number of negative labels as N , the Balanced Accuracy BA is computed as: BA = 1 2 The training and optimization of the meta-models is performed with 5-fold Nested Cross-Validation.First, we create a 5-folds testing split with a training fold and a testing one which we will use to train and evaluate the meta-model.In order to find the optimal hyperparameters for the meta model, we split each training fold with a further 5-folds split, the optimization split.This results in 5 optimization splits for each of the 5 training folds of the testing split and is aimed at preventing the overfitting of the meta-models.The splits are all stratified with respect to the problem class of the instances, to ensure every split has an equal distribution of the problem classes.All meta-models are trained on the same data splits and we perform different splits for the large and small instances.The hyperparameters of the meta-models are optimized according to a Bayesian Search [41,42] exploring 50 configurations, we select those that provide the best Balanced Accuracy on the optimization split.
Once the meta-models have been optimized, we use them to assess which problem characteristic, i.e., feature, is most important.We use Permutation Feature Importance (PFI), which evaluates how the accuracy of a model drops when the values of a certain feature are shuffled.The idea is that the more important a feature is the larger will be the drop when the values of that features are shuffled.For each feature the process is repeated multiple times and the corresponding importance is given by the mean of the drop in accuracy observed.

Results and Analysis
In this section we provide the most relevant insights of our analysis regarding the effectiveness of QA.We have three goals: (i) determine hitch classes of problems are more difficult to solve with QA, (ii) understand whether it is possible to predict the effectiveness of QA based on the features we have identified; and (iii) discover the domains, the component sets and the features that impact the effectiveness of QA.To do so, we describe the results obtained by solving the instances with QA and with the other classical solvers.Then, we describe the results of the validation of the metamodels and of the Permutation Feature Importance performed on their features.We publish online a dataset with all the instances we generated, the features we computed and the samples obtained for each solver.9

Effectiveness of QA for Large Instances
In this section we discuss the effectiveness of QA compared to the other classical solvers (SA, TS and SD) on the large problem instances.Table 3: Comparison on the percentage of problem instances in which QA is at least as effective as a specific solver (QA-over-SA, QA-over-TS and QAover-SD) or as all of them combined (QA-over-all).
sample found by QA is at least as good as that found by a specific solver (QA-over-SA, QA-over-TS and QA-over-SD) or by all of them combined (QA-over-all).
As a general comment, we can observe that for less than half of the problem classes (4 out of 9) QA solves effectively more instances than at least one classical solver, while for most of the problem classes (7 out of 9) QA is more effective than at least one of the classical solvers for some particular instances.However, if we combine all classical solvers QA is more effective only in three problem classes but mostly to a limited extent.Only for Max-Cut QA shows a consistently high effectiveness.These results confirm that the effectiveness of QA depends on the problem class, as is the case for classical solvers, which is consistent to what observed in previous studies [20][21][22].If we compare QA and classical solvers throughout the problem classes, we can see that QA is very frequently more effective than SA, while it is more effective than TS or SD only on some specific problem classes.As a result, we conclude that comparing QA only with SA, without considering other solvers, is not the best practice to evaluate the effectiveness of QA.
Regarding the characteristics of the problem classes, a first observation we can make is that QA is more effective on problems that do not require penalties to represent constraints: Max-Cut, Community Detection and Number Partitioning.This suggests that the presence of constraints is a factor that makes a problem more difficult to solve with QA.The reason for this may be due to the type of quadratic terms introduced by the penalties which could open new research directions in whether one could use a different formulation for the same constraint that is more suitable for QA [48].Furthermore, remember that Max-Cut, Maximum Independent Set and Minimum Vertex Cover share the same Ising coupling matrix J, with the exception of a multiplicative factor, but have a different bias vector b.We can observe how Max-Cut is the only one among them that is solved effectively by QA, suggesting that the bias structure plays an important role as well.
Lastly, a high number of quadratic terms (i.e., a dense coupling matrix J) does not always negatively affect QA.In particular, both Community Detection and Number Partitioning have a dense coupling matrix but still 13% of Community Detection instances and the 30% of Number Partitioning instances are solved effectively with QA.

Effectiveness of QA for Small Instances
In this section we discuss the effectiveness of both QA and the other classical solvers (SA, TS and SD) on the small problem instances.For these instances we can compute the cost associated to all variable assignments and the global optimum.We do this by computing the Hamiltonian of the problem H p and use its eigenvalues (i.e., its diagonal).We also compute the maximum energy values needed to assess the ǫ-Optimality for the samples of QA.Table 4 compares the effectiveness of the solvers according to the labels defined in Section 3.2, i.e., if the solver finds the global optimum (Optimal), if the energy of the best sample is close to that of the global optimum (for ǫ-Optimal we use ǫ = 10 −5 ), if the variable assignment of the best sample has an Hamming Distance of at most 1 with any of the global optimum solutions (h-Optimal).
Consistently with what observed for the large instances, QA is less effective than the classical solvers on all the effectiveness conditions.If we compare 10 −5 -Optimal and h-Optimal, we can see that QA finds more h-Optimal samples than 10 −5 -Optimal samples, as opposed to the other solvers.This indicates that QA finds more easily samples which are close to the optimal ones in terms of Hamming distance rather than energy.Notice that in this experiment SD is the most effective solver, this may be related to the small size of the instances which may make them relatively easy to solve with simple strategies.
As done for the large instances, we compare the effectiveness of the solvers on the problem classes.Table 5 shows the fraction of problem instances in which the solver finds the global optimum (i.e., Optimal).Overall, as opposed to what we observed for the large instances, on the small ones QA is never more effective than the classical solvers.QA seems to be more effective for problems that are defined over graphs (Max-Cut, Maximum Clique, Community Detection) compared to the ones that are not.Note however that on two graph problems, Minimum Vertex Cover and Maximum Independent Set, QA performs significantly behind the classical solvers.Based on the analysis of the large instances (see Section 5.1) we observed that the bias of the problem seems to play a role in affecting the effectiveness of QA.This observation is confirmed here as well since QA is much more effective in solving Max-Cut problems than in solving Maximum Independent Set and Minimum Vertex Cover ones.We can also observe that the effectiveness of QA is quite poor for Set Packing and Feature Selection, being significantly behind the classical solvers.Interestingly, on Max-Cut and 4×4-Sudoku problems almost all the solvers find the global optimum, while for the Quadratic Knapsack hardly any instance can be solved optimally at all, with the most effective solver being TS with 4% of the instances solved optimally.
As a second analysis we study the effectiveness of QA in sampling solutions with a variable assignment that is close to one of the optimal ones in terms of Hamming distance (i.e., h-Optimality).The results reported in Table 6 are consistent with the previous ones in which we assessed the ability of the solvers to find the global optimum, see Table 5.We should note QA exhibits much better effectiveness, when measured in this way, being able to sample solutions close to the optimal ones in the majority of cases.For example, the effectiveness on the Number Partitioning problem goes up from 33% to 96% while for Minimum Vertex Cover goes from 17% to 58%.The results also confirm that QA is quite effective for problems that do not require penalties to model constraints (Max-Cut, Community Detection and Number Partitioning).On the other hand, QA is still ineffective for the Feature Selection problem.Quadratic Knapsack remains very challenging for all the solvers.

Meta-Models and Feature Importance Analysis
The goal of our analysis is to identify the domains and the component sets whose features allow the meta-models to predict well the effectiveness of QA.We limit our analysis to the meta-models trained to predict whether QA will be at least as effective compared to all the classical solvers combined (target label QA-over-all) for the large instances.We also analyze the meta-models predicting whether QA will find the global optimum on the small instances (QA-Optimal).The full results are available in the online appendix.The first important question is whether it is possible to train meta-models able to predict the effectiveness of QA.The results in terms of Balanced Accuracy are shown in Figure 2 (target QA-over-all) and in Figure 3 (target QA-Optimal) for the two most effective classifiers.We can immediately see that for several domains or components of the large instances the Balanced Accuracy is approximately 85%, while for the small instances it is often exceeding 90 or even 95%.This, combined with the fact that we selected a heterogeneous set of problem classes and instances that are solved by QA with different degrees of success, allows us to conclude that it is indeed possible to build accurate meta-models to predict the effectiveness of QA.These meta-models can then be used for many purposes, among which, studying the behaviour of this technology.
We now analyse which are the domains or components that produce the best metamodels.Concerning the domains, the more informative ones are those related to the graph structure of an Ising problem (LogIsing and EmbIsing) which, based on the high accuracy of the meta-model, we conclude are very informative on the effectiveness of QA.Among the domains, the distribution of the values in the Q matrix of the QUBO problem (MatStruct) is less informative, this can be explained by the fact that the problem that is actually solved on the quantum device is represented as Ising and not as QUBO.
Secondly, if we consider the domains related to the distribution of the energies of an Ising problem which are only available for the small instances (SolSpace, NorMul, 25%-SolSpace, 25%-NorMul in Figure 3) it is possible to build meta-models which, again, predict well the effectiveness of QA with a Balanced Accuracy well above 90%.This result shows that the effectiveness of QA also depends on the distribution of the energies of the problem, i.e., on how the cost y of the QUBO problem is distributed.Notably, using features based on the solution space allows to achieve comparable Balanced Accuracy with other domains, indicating that both are equally very informative.This is a particularly good result because it is relatively easy to compute the features for the other domains once the problem has been formulated as QUBO.
If we look at the orthogonal grouping of the features, by component sets, we notice that with the Bias, Coupling the Laplacian component sets it is possible to train at least a meta-model with good Balanced Accuracy (higher than 80%).On the other hand, the Structural Adjacency (StructAdj) and Structural Laplacian (StructLap) are the least informative and, in particular for large instances, do not allow to build a meta-model better than random guess.Since both StructAdj and StructLap are computed on the binarized problem structure, they only account for how the problem variables are connected and not the coefficient values, this means that the structure of the problem alone is not informative at all.The bias and the coupling of an Ising problem, together with the Laplacian matrix related to the graph of the Ising problem, are the most informative on the effectiveness of QA.
In general, we can confirm that the characteristics of the problem are important to determine the effectiveness of QA but one must account for the actual coefficient values of the problem and, preferably, use features derived from the Ising formulation.This could open new research questions on whether one can change the formulation of a problem so that its coefficients have a different distribution that is more adequate for the QA.Furthermore, since the coefficients are a function of the problem class and the data, it may be possible to identify which types of graph topologies may be more or less difficult to tackle based on the distributions of the coefficients that they would produce.We now move to analyzing which specific features are the most important among the ones we identified.We limit our analysis on feature importance to the XGBoost and AdaBoost meta-models trained on two domains (LogIsing and EmbIsing) and on two component sets (Bias and Coupling).In the case of meta-models related to smallinstances, we include in the analysis also the domain SolSpace.The best five features of each of these domains and component sets are listed in Table 7 (target QA-over-all, large instances) and in Table 8 (target Optimal, small instances).

Domains Feature Importance
We consider in particular the domains LogIsing and EmbIsing.The majority of the most important features are related to the bias and to the coupling of the problem, which is consistent with our previous analysis.Some features are related to the distribution of the values of the bias and the coupling (Gini index, Shannon entropy, Herfindahl-Hirschman index), while other features are related to precise values of these mathematical objects (minimum value, maximum value).
In particular, notice that Bias gini index (related to the distribution of the bias), Bias condition number (related to the values of the bias) and Coupling max eigval (related to the eigenvalues of the coupling) are among the best features in the majority of the meta-models.We deduce that the distribution of the values and the values themselves of the bias are important to study the effectiveness of QA, together with the eigenvalues of the coupling.This is again an interesting observation because it would allow us to identify in advance whether a problem could be well-suited for QA.Notice also that the number of qubits needed to embed the problem on the Quantum Annealer (Graph Structure qubits) is important, for one meta-model, to predict the effectiveness of QA for the large instances, but not for the small instances.This difference may be linked to the fact that, for the small instances, the number of qubits required after the minor-embedding process is limited and therefore has a lower impact.
We analyze, for these two domains, the least important features too.The majority of them is related to the structural adjacency and to the structural Laplacian matrix.This confirms that the sole structure of a problem is not sufficient to determine the effectiveness of QA.Thus, we must consider also the coefficient values between the variables.
For what concerns the SolSpace domain (see small instances in Table 8), notice that both the meta-models have the same top three features, although in different order: such features are mostly related to the distribution of the eigenvalues of the problem (gini index and grouped hhi), which plays therefore a role in determining the effectiveness of QA.

Component Sets Feature Importance
If we consider the Bias component set, the majority of the most important features are related to the distribution of the values of the bias, both considered in the LogIsing domain and in the EmbIsing domain.In particular, observe that the Gini index and the condition number of the Bias, computed in both domains, are among the five  most important features, as they were for the meta-models trained on the domains LogIsing and EmbIsing.This corroborate our statement that the distribution and the values of the bias are important to determine the effectiveness of QA.
If we instead consider the Coupling component set, we observe that most of the important features we identify are related to the values and to the eigenvalues of the coupling.Features computed in both the LogIsing and EmbIsing domains are important and the most important features are related to the values of the eigenvalues of the coupling.Notice that the spectral gap and the Gini index of the coupling, computed both in the LogIsing and EmbIsing domains, are shared with almost all meta-models as some of the most important features.We conclude that the eigenvalues of the coupling and their distribution are important to analyze the effectiveness of QA.

Feature Correlation with Target Label
In the previous analysis we identified the features that are the most important for the meta-models.Some of these features, furthermore, are important also if computed in different domains and for different meta-models.These features are, for the LogIsing and EmbIsing domains: Bias gini index, Bias condition number, Coupling max eigval; while for the SolSpace domain: gini index, grouped hhi, and third quartile.We want now to give an intuition of which values of these features determine a low or an high effectiveness of QA.For each feature we identified, we compute the Spearman rank with the targets of the meta-models (QA-over-all and QA-Optimal).
For most of these features, the Spearman rank does not highlight a strong correlation with the value of the target (in general, the Spearman rank in absolute value is below 0.40).This suggests that the complexity of the underlying behaviour might require more powerful tools.Two exceptions are given by gini index and grouped hhi in domain SolSpace, which have respectively a Spearman rank of −0.596 and 0.537.This mean that as the Gini index of the eigenvalues increases, the Ising problem becomes less difficult to solve.On the other hand, high values of grouped hhi of the eigenvalues implies that the Ising problem is difficult to solve with QA.

Conclusions
In this paper, we have studied the effectiveness of QA with an empirical approach based on meta-learning models.
First, we select a pool of ten optimization problems which can be formulated as QUBO.Then, we generated more than five thousand instances, based on different problem sizes and structures.In particular, we created two sets, one containing large instances and another with small ones for which we can study also the properties of the whole solution space.
As a second step, we define a set of more than a hundred features to describe each problem instance.The features are heterogeneous, based on graph theory or on metrics largely used in statistics, probability theory and economics and account for the structure of the problem, its coefficients and its solution space.We gather all the features into a meta-learning datasets, which we share on GitHub for further research.
Third, we compare the effectiveness of QA and three classical solvers: Simulated Annealing (SA), Tabu Search (TS) and Steepest Descent (SD).We observe that QA is frequently less effective than the classical solvers, for both the large and small instances, except for specific problems.In particular, we have observed that QA solves more effectively problems with no constraints in their formulation (Max-Cut, Number Partitioning and Community Detection).
Lastly, we train different classification algorithms to predict whether QA will solve an instance effectively or not and show that it is possible to do so accurately.We then use the meta-models to probe the behaviour of QA.In particular, by analyzing the feature importance of the meta-models, we can observe how the distribution of the bias and the coupling of a problem play a key-role in determining whether QA will effective in solving it.
In conclusion, we successfully applied an empirical analysis of the effectiveness of QA based on meta-learning.Possible future directions include the analysis on how different distributions of the coupling and bias values relate to the effectiveness of Quantum Annealing.Such results could be correlated to specific kinds of problems.For example, problems characterized by graphs with a power-law distribution (e.g.problems involving social networks) may be more or less difficult to tackle than those characterized by regular graphs.This can be done, for example, by defining new features which describe how much the distribution of the bias and the coupling differs from another distribution, e. g. from a Gaussian or a uniform distribution.Thanks to its generality, the methodology can be easily extended to other heuristic solvers of QUBO problems, such as the Variational Quantum Eigensolver (VQE) [25] and Quantum Approximate Optimization Algorithm (QAOA) [24], providing a useful tool to further our understanding of how to use these quantum algorithms effectively.

Supplementary information.
The meta-learning dataset with all the problem instances, the graphs and features, as well the samples obtained with each solver can be accessed here: https://github.com/qcpolimi/QA-MetaLearning.

Declarations
This version of the article has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections.The Version of Record is available online at: Funding.We acknowledge the financial support from ICSC -"National Research Centre in High Performance Computing, Big Data and Quantum Computing", funded by European Union -NextGenerationEU.We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support.We also acknowledge the support and computational resources provided by E4 Computer Engineering S.p.A.

A.1.1 Max-Cut
Given the graph G = (V, E), a cut over graph G induced by the set of vertices S and V − S is the set of edges which connect vertices in S with vertices in V − S. The Max-Cut problem aims at finding the largest cut which can be induced on graph G [26].Consider, for each vertex i of graph G, a binary variable x i , such that Then, the Max-Cut problem is formulated as written in Equation A1

A.1.2 Maximum Independent Set
Given the graph G = (V, E), an independent set is a set S of vertices which are not adjacent to each other.The Maximum Independent Set problem aims at finding the largest independent set in G. Consider, for each vertex i of graph G, a binary variable x i , such that Then, the Maximum Independent Set problem can be formulated as a QUBO problem as it is written in Equation A2 [49].
The penalty term b outweigh the term a, to penalize the choice of having two connected vertices in S.

A.1.3 Minimum Vertex Cover
Given the graph G = (V, E), a Vertex Cover C is a set of vertices such that all the edges E are connected at least to a vertex in C. The Minimum Vertex Cover is the Vertex Cover with the smallest number of vertices [26].Consider, for each vertex i of graph G, a binary variable x i , such that The Minimum Vertex Cover can be formulated as a QUBO problem as written in Equation A3. min As for the Maximum Independent Set problem, also in the Minimum Vertex Cover problem the penalty term p must be chosen large enough, to penalize the selection of a set of vertices which is not a vertex cover of G.

A.1.4 Max-Clique
Given the graph G = (V, E), a clique C is a sub-graph of G such that C is fully connected.The Max-Clique problem aims at finding the clique C of G with the highest number of nodes.The QUBO formulation of the Max-Clique problem has the same QUBO formulation of the Maximum Independent Set, but it leverages the complement graph Ḡ of G [49].Please refer to the Section A.1.2.

A.1.5 Community Detection
Given the graph G = (V, E), the aim of the Community Detection problem is to partition G in two communities C 1 , C 2 of similar nodes [40,50].Remember that the graph G is represented by the adjacency matrix A, where the (i, j) entry is A ij = 1 if nodes i and j are connected by an edge, otherwise it is A ij .The degree vector d keeps track of how edges are incident to every node of the graph.The similarity between nodes is expressed by the modularity matrix B, which is computed according to the following formula: The binary variable x i is related to the node i ∈ V of graph G and it is defined as follows: The matrix Q is proportional to B. In particular, it is equal to: The negative sign is due to the fact that QUBO problems require be problems in order to be solved with a QA.The formulation of the problem is therefore given by: min

A.2.1 Number Partitioning
Given a set of real numbers Z = {z 1 , z 2 , ..., z n }, the Number Partition problem aims at finding two partitions Z 1 , Z 2 of Z in order to minimize the following expression [26]: The binary decision variable x i is defined as follows: The QUBO formulation of the Number Partitioning problem is then derived from the expression A4, and it can be written as follows:

Set Packing
a collection of n sets S 1 , S 2 , ..., S n , each one with a given capacity c 1 , c 2 , ..., c n , the Set Packing problem aims at finding a selection of sets providing the largest total capacity, while respecting m constraints for the selection of the sets.The binary decision variable x i is associated to the set S i , in particular: The Set Packing problem can be formulated as it follows: In order to map the constraints of the Set Packing problem into a quadratic penalty we leverage the following conversion rule [26]: where coefficients a ji and a jk are either 0 or 1.The objective function is therefore derived as a ji a jk

A.2.3 Quadratic Knapsack
Given a set of projects P 1 , P 2 , ..., P n such that for each pair (P i , P j ) it exists a joint revenue r ij = r ji , the Quadratic Knapsack problem aims at maximizing the total revenue of activating the projects under a budget The Quadratic Knapsack problem is formulated as it follows: To formulate the Quadratic Knapsack problem in the QUBO formulation, we introduce m binary slack variables t 1 , t 2 , ..., t m .Each slack variable has a budget coefficient c tj which has to be chosen accordingly to the budget b, e. g. by stating that m j c tj = b.With these type of contraints, we rewrite the problem using the formulation given in Equation 2, defined in Section 2.1.The QUBO formulation of the Quadratic Knapsack problem is therefore:

A.2.4 Sudoku
A Sudoku can be considered as a constraint satisfaction problem [51].In our study, we analyze only small instances of Sudoku problems, so we consider Sudoku with a 4×4 grid and some fixed assignments.We call such problems 4×4-Sudoku The binary variable we use is defined as There are four kinds of constraints in a Sudoku:

Block Constraints.
Two cells in the same block must have distinct numbers; (i,j)∈B x (i,j),k = 1 ∀k ∈ {1, 2, 3, 4}, There are totally 64 constraints in a 4 × 4-Sudoku problem.If some cells are already assigned, the number of constraints and the number of variables reduces.
In particular: -If cell (i, j) is assigned, there exist no variable related to this cell; -If cell (i, j) is not assigned and the number of values it can have is m, then there exist m variables related to cell (i, j) and each variable refer to a possible value of the cell; We can map all the constraints into a single square matrix A, which has at most 64 rows and 64 columns.Each column is related to a variable x (i,j),k , while each row refers to a constraint.Assume that x is a vector where each element is a variable x (i,j),k .If 1 is a vector of 64 elements, all equal to 1, the QUBO formulation of the 4 × 4-Sudoku problem is:

A.2.5 Feature Selection
Given a dataset D, the Feature Selection problem aims at finding the subset S ⊂ F = {f 1 , f 2 , ..., f n } of the best k features able to represent the dataset.This is done especially in machine learning, to reduce the number of features of a predictive model therefore its complexity.Assume to have n For each feature f i , we have the binary decision variable x i , defined as follows: We model the Feature Selection problem using Pearson correlation Corr(•, •) [4].For the quadratic terms, we compute the correlation between two different features f i , f j of D. For the linear terms, we compute the correlation between a feature f i and the target t of D. We can compute directly the QUBO matrix Q, where in its (i, j) element of matrix Q is equal to: Suppose we want to select k features.To do so, we consider the following QUBO formulation of the Feature Selection problem: Notice that, if less or more than k are selected, the penalty term is larger than 0 and the objective function value gets worse.

Appendix B Instance Generation Strategies
In this section, we explain the strategies we applied to generate the instances of the optimization problems we selected.The strategies varies whether the problem is defined over a graph or not.An instance of a problem is characterized by its structure, which is either the topology of a graph or a general title which refers to the strategy used to generate the constraints and the objective function, and the number of variables n.
In general, for all the problems we generate instances with a minimum of n min variables to a maximum of n max variables.For every optimization problem, except for the 4 × 4-Sudoku and for the Feature Selection problem, there exist n rep instances having a certain structure and with the same number of variables.Such instances are however different between each other, thanks to graph tweaking (explained in Section B.1) and to the random generation of the coefficient of the cost function and of the constraints.
For the small instances, we have chosen n min = 27, n max = 32 and n rep = 1.For the large instances, we have chosen n min = 69, n max = 99 and n rep = 5.

B.1 Graph Problems
The instance of a problem defined over a graph is totally determined by the topology of the graph.The strategy we applied to generate instances of graph problems is essentially divided in two distinct phases: the first is the generation of the graph G of n nodes, according to a certain topology; the second is the generation of an instance, for every graph problem, defined over G.
We have selected four different graph topologies to generate the graph G: the Star topology, the Cycle topology, the 2d-grid topology and the Erdös-Rényi topology.A key step in the generation of G is the random insertion and removal of a limited number of edges, resulting in the tweaking of graph G.In this way, we generate more graphs starting from a given topology and with n nodes, which however get tweaked in different ways.We ensure always that the tweaked graph is connected.In particular, the graph tweaking process happens according to the following rules: • At most n ins = n 6 edges are inserted; • At most n rem = n 8 edges are removed; • To perform the tweaking, we randomly modify the adjacency matrix A of the original graph.We start the tweaking of A from the first element of the first row, A 11 , and we proceed left-to-right, till the element in the last column and in the last row, A nn .• If A ij = 0 and less than n ins have been inserted, insert the edge (i, j) by setting A ij = 1 with probability p ins = 40%; • On the contrary, if A ij = 1 and less than n rem edges have been removed, remove the edge (i, j) by setting A ij = 0 with probability p rem = 30%; • If the tweaked graph is not connected, repeat the procedure; For the Minimum Vertex Cover and the Maximum Independent Set problems we have to set the value of the penalty term coefficient p.For the Minimum Vertex Cover, we have set p = n, that is to the number of nodes of the graph G.For the Maximum Independent Set, we have set p = 2n.In both cases, we have that an assignment which violates a constraint highly penalize the objective function and that it is not a solution of the instance.

B.2 No-Graph Problems
For what concerns the problems not defined over a graph, to generate an instance we have to generate the matrices related to the objective function and to the constraints.In particular, we want to analyze satisfiable instances.We discuss now the strategies we used to generate the instances of each problem we selected.

B.2.1 Number Partitioning
To generate the instances of the Number Partitioning problem, we have to define the set Z to partition.For simplicity, we generate only sets of integer numbers.We selected three probability distribution to generate the set Z, in addition to a fourth strategy where Z contains all the numbers between 1 and the number of variables n.The three probability distributions are: (i) an uniform distribution between 1 and 99; (ii) a geometric distribution with probability of success p = 0.02; (iii) a Poisson of mean value µ = 50.

B.2.2 Set Packing
An instance of the Set Packing problem is determined by the capacities of the n sets c 1 , c 2 , ..., c n , the coefficients a j,i of the constraints, and the coefficient of the penalty term p.
For what concerns the capacities c i , we sample highest possible capacity coefficient, called c max , from a uniform distribution between 10 and 39.This number represents the highest possible capacity coefficient, called c max .Then, we generate the capacities c 1 , c 2 , ..., c n of the sets by sampling them from to a uniform distribution between 1 and c max .
To generate the coefficients a j,i , we consider a matrix A where the element in the j-th row and in the columns is equal to a j,i .that A has always n columns.
have defined strategies to generate matrix A: • A is rectangular, with n − 1 rows and the elements a i,i and a i,i+1 are equal to 1, for i = 1, 2, ..., n − 1.All the other elements are set to zero.We say that this matrix has a step structure.We provide below, as an example, a matrix of this kind with n = 5: • A is rectangular, with a number of rows m randomly sampled from a uniform distribution between 2 and n 2 .All the columns of A have exactly one element equal to 1.We say that this matrix has disjoint rows structure.Below, we provide an example of a disjoint rows matrix, with n = 5 and m = 3: • A is a square matrix.The elements on the main diagonal are all equal to 1. Furthermore, for each row of A, a random off-diagonal element is set to 1 with probability p = 60%.We say that this matrix has an almost diagonal structure; • A is rectangular, with a number of rows m sampled from a uniform distribution between 10 and n 2 2 .The elements of A are randomly generated between 0 and 1, with equal probability.We say that this matrix has a random structure; Also the penalty term p must be chosen, in order to penalize the assignments which violate the constraints.We use 100 steps Bayesian optimization to choose p, by maximizing the percentage of feasible solutions found by Simulated Annealing in 30 executions.We generate the joint revenue values by generating a matrix.The general joint revenue value r ij is the element in the i-th row and in the j-th column of a revenue matrix R. By definition of the Quadratic Knapsack problem, R is symmetric.We generate R according to the following four strategies: • R is diagonal.The elements on the diagonal are sampled from a uniform distribution of integer numbers between 15 and 39.We say that the structure of R is diagonal; • R has the whole diagonal, with some other few random off-diagonal elements.The elements on the diagonal of R sampled from a uniform distribution of integers numbers between 15 and 39.Then, for every column of R, we set with probability 40% a random off-diagonal r ij with an integer number sampled from a uniform distribution between 1 and 24; when the element is set, we make the matrix symmetric by dividing r ij by 2 and setting r ji = r ij .In the case r ji element is set again, when considering column j, the previous value of r ji and r ij is overwritten.We say that R has an almost diagonal structure.An example of almost diagonal 4 × 4 matrix is: • R is generated randomly, every element is sampled from the uniform distribution of integer numbers between 15 and 39; then, all the elements below the diagonal are ignored and set to 0; finally, the off-diagonal elements are divided by 2 and the matrix is made symmetric, by setting r ij = r ji , for every i, j ∈ {1, 2, ..., n}.We say that R has a random structure.
The coefficients c 1 , c 2 , ..., c n which appear in the constraint are integer numbers sampled from a uniform distribution between 1 and 15.The budget b is set as φ * n i=1 c i , where φ ∈ R is sampled from a uniform distribution between 0.20 and 0.70.We have chosen to introduce four binary slack variables t 1 , t 2 , t 3 , t 4 to transform the inequality constraints into equality constraints.The slack variables are respectively associated to the budget coefficients c t1 , c t2 , c t3 , c t4 .We choose also the values of these coefficients as it follows:

B.2.4 Sudoku
To generate 4 × 4-Sudoku instances, we generated randomly 30 different solved 4 × 4-Sudoku games.We then iteratively removed the value of a cell, chosen randomly, of the 4 × 4-Sudoku game.Each time we remove a cell, we increase the number of variables of the related 4 × 4-Sudoku instance.We ensure that the generated instance has a number of variables between 27 and 32.Notice that it is not possible to build instances larger than 64 variables of the 4 × 4-Sudoku problem.For this reason, we generate only small instances for this kind of problem.

B.2.5 Feature Selection
We generated the instances of the Feature Selection problem starting from the following public datasets10 : waveform-5000, SPECTF, spambase, USPS, isolet, gisette, Bioresponse.An instance of the Feature Selection problem depends on the dataset D, on the number of features to select k and on the target variable t.
Assume we generate m instances with a number of variables n, with n bounded between n min and n max To select the dataset D, we use an iterative procedure.We start from the dataset waveform-5000 and we check if it has more than n = n min non-target features: if this condition is true, we select this dataset and we reduce its number of features to n, by deleting randomly chosen features; otherwise, we go to the the next dataset, chosen according to the order we used to list them, and we repeat this check.After the last dataset, Bioresponse, we repeat staring from waveform-5000.After that m features are selected, we continue the procedure by incrementing n.The procedure goes on until we generate m instances with n = n max variables.
Assume that the dataset D is the i-th dataset selected to generate an instance of n variables, in the iterative procedure, the number of features to select is computed as follows: For what concerns the target variable t, every dataset we selected has one or multiple target variables.Since this formulation can tackle only one target variable t, in case of multiple target variables we randomly select one of them and delete all the others.
For small instances, we considered n min = 27, n max = 32 and m = 4, for a total of 24 instances.The number of features k we select is bounded between 5 and 9.For large instances, we considered n min = 69, n max = 99 and m = 5, for a total of 155 instances.The number of features we select is bounded between 13 and 23.
If two firms F i , F j merge into a larger new firm F z , we have that S(F z ) = S(F i ) + S(F j ).The HHI computed on the new industry, which include F z , increases.

Shannon entropy
Given a random variable Y , distributed according to a certain distribution p Y (y).The Shannon entropy of Y measures the level of uncertainty in the distribution p Y (y).In the discrete and finite case, assuming that the possible values of Y are y 1 , y 2 , ..., y m , Shannon entropy is defined as: In general, Shannon entropy can be than 1.The maximum value Shannon entropy occurs when Y is distributed according to a uniform distribution, where all the values of Y are equally probable and none of them can be predicted more easily than the others.

Condition Number
Given a matrix M , the condition number CN (M ) measures how close is matrix M to be singular.If M is symmetric and real, the CN (M ) is computed according to the maximal and minimal eigenvalues of M , respectively λ max and λ min :

Radius of a Graph
Given a graph G(V, E), described by the adjacency matrix A. The radius of graph G is the largest eigenvalue of A in absolute value.

Diameter of a Graph
Given a graph G(V, E), the shortest path between two nodes i, j is the sequence of edges having minimal cost which connects i and j.The diameter of G is the length of the longest shortest path.If G is not connected, the diameter is not defined.In the case of negative weights on the edges that form a cycle, the diameter is not defined too.

Spectral Gap of a Graph
Given a graph G(V, E), described by the adjacency matrix A, the spectral gap of G the difference, in modulus, between the the two largest eigenvalues of A. Another definition we use is related to the Laplacian L of the graph.In this case, the spectral gap is the smallest non-zero eigenvalue of the Laplacian L related to the graph.

Connectivity of Graph
Given a graph G(V, E), described by a positive semi-definite Laplacian matrix L, the connectivity is equal to the second smallest eigenvalue of L.

( a )Fig. 1 :
Fig.1: Embedding of a simple problem with six variables on a portion of a D-Wave Quantum Annealer using the Chimera topology.Each node represents a qubit and each edge a physical connection between them.Nodes of the same colour indicates the chain of qubit used to represent a single problem variable.Note how, while the original problem had six variables, the embedded one requires 14 qubits.

Fig. 2 :
Fig.2: Bar plot showing the Balanced Accuracy of the meta-models which predict whether QA is at least as good of all the classical solvers combined (QA-over-all) on the large instances.The domains or component sets the model is trained on are listed on the x-axis.Domains and component sets are ordered according to the Balanced Accuracy of the best related meta-model, in descending order.The vertical black segments on the top of each bar represent the standard deviation of the meta-models.

Fig. 3 :
Fig.3: Bar plot showing the Balanced Accuracy of the meta-models which predict whether QA will find the global optimum (Optimal) on small problem instances.The domains or components the model is trained on are listed on the x-axis.Domains and component sets are ordered according to the Balanced Accuracy of the best related meta-model, in descending ordered.The vertical black segments on the top of each bar represent the standard deviation of the meta-models.

•
The on-diagonal elements of R are non-zero; also the elements immediately on the right and on the left of on-diagonal elements are non-zero; the on-diagonal elements are integer numbers sampled from a uniform distribution between 15 and 39; the off-diagonal elements immediately to the right of on-diagonal elements are sampled from the same distribution; then, the off-diagonal elements are divided by 2 and the matrix is made symmetric, by setting r ij = r ji , for every i, j ∈ {1, 2, ..., n}.We say that R has an enlarged diagonal structure.An example of a 4 × 4 enlarged diagonal is: steps Bayesian Search to choose p, by maximizing the percentage of feasible solutions by Simulated Annealing in 30 executions.The range of the values of p depends on the sum r all = n i=1 n j=1 |r ij |.In particular: 1 ≤ p ≤ r all

Table 1 :
Each row of the table gives the number of small and large instances related to each problem class.
This domain uses the eigenvalues, or energy values, of all possible variable assignments, which can be computed only for the small instances, and aims to describe how they are distributed.Other domains we identify are: (i) Normalized Multiplicity (NorMul), whose features are related to the multiplicity of the eigenvalues of H p ; (ii) Matrix Structure (Mat-Struct), which contains features related to the distribution of the values of the matrix Q of a QUBO problem; (iii) 25%-SolSpace and 25%-NorMul, which contain the same features of SolSpace and NorMul, but computed by considering only the 25% lowest eigenvalues of H p , i.e., the energies of the 25% best solutions.
• Logical Ising Graph (LogIsing): This domain uses the Ising formulation of a QUBO problem.It is represented as a graph having one node per problem variable, associated to the corresponding bias b, and an adjacency matrix that corresponds to the coupling matrix J; • Embedded Ising Graph (EmbIsing): This domain uses the Ising formulation of a QUBO problem obtained after its minor embedding on the QA.The target architecture is D-Wave Advantage with the Pegasus topology.Therefore, this formulation represents the actual problem solved by the Quantum Annealer, in which multiple qubits may be used to represent one problem variable.This formulation is represented as a graph in the same way as LogIsing; • Solution Space (SolSpace):

Table 2 :
Optimal Hyperparameters for QA for each problem class.
Table 3 reports the results on each problem class according to the labels we defined in Section 3.2, i.e., whether the best

Table 4 :
Fraction of the instances that are solved well according to a certain effectiveness condition (see Section 3.2).The most effective solver is highlighted in bold.

Table 5 :
Fraction of the instances in which the solvers are able to find the global optimum (i.e., Optimal).The most effective solver of each problem class is highlighted in bold.

Table 6 :
Fraction of the instances in which the solvers are able to find a variable assignment having a Hamming distance of at most 1 with respect to any optimal solution (i.e., h-Optimal).The most effective solver of each problem class is highlighted in bold.

Table 7 :
Best five features, ordered according to feature importance, of AdaBoost and XGBoost meta-models trained with LogIsing and EmbIsing domains and with Bias and Coupling component sets.The target of the meta-models is QA-over-all.

Table 8 :
Best five features, ordered according to feature importance, of AdaBoost and XGBoost meta-models trained on given domains and component sets.The target of the meta-models is QA-Optimal.
The range of values of p is determined by the sum c all = In the Quadratic Knapsack problem, to generate the instances we have to choose: (i) the values of the joint revenue r ij ; (ii) the values of the coefficients c 1 , c 2 , ..., c n of the constraint; (iii) the budget b; (iv) the number m of slack variables t i to introduce and their constraints coefficients c ti ; (v) the value of the penalty term coefficient p.

Table F3 :
Table containing, for each considered problems, the fraction of 10 −5 -optimally solved instances for all the solvers.The values in bold text refer to the most effective solvers for a particular problem