Estimating the Density of States of Frustrated Spin Systems

Estimating the density of states of systems with rugged free energy landscapes is a notoriously difficult task of the utmost importance in many areas of physics ranging from spin glasses to biopolymers to quantum computing. Some of the standard approaches suffer from a spurious convergence of the estimates to metastable minima, and these cases are particularly hard to detect. Here, we introduce a sampling technique based on population annealing enhanced with a multi-histogram analysis and report on its performance for spin glasses. We demonstrate its ability to overcome the pitfalls of other entropic samplers, resulting in some cases in orders of magnitude scaling advantages that can result in the uncovering of new physics. To do that we devise several schemes that allow us to achieve exact counts of the degeneracies of the tested instances.


I. INTRODUCTION
In statistical and condensed matter physics, the density of states (DOS) of a system describes the number of states at each energy level. The DOS, which is independent of temperature, represents a deep characterization of the system. In terms of thermodynamics, knowledge of the DOS allows one to calculate the partition function and hence all expectation values that can be derived from it, including the free and internal energies as well as the specific heat, as a function of temperature [1]. Alternatively, the DOS itself and closely related quantities are the center of interest in an analysis of the thermodynamics of phase transitions in the microcanonical ensemble [2]. DOS calculations can also determine the spacing between energy bands in semi-conductors [3].
Whilst knowledge of the DOS, Ω(E), is extremely valuable, it cannot in general be easily acquired. Exact calculations are only possible in a few special cases such as Ising models on two-dimensional lattices [4,5]. In general, the problem is exponentially hard as the system size increases (in computational complexity terms it is a #P-hard problem) [6,7]. This difficulty notwithstanding, there exist a number of approximation techniques, mostly based on Monte Carlo methods, that allow one to estimate Ω(E). The most widely used approach of this type is the Wang-Landau (WL) algorithm [8,9] and its variants [10][11][12], which is based on the multicanonical method [13].
As stochastic approximation techniques, these approaches are affected by statistical errors as well as systematic deviations (bias). Estimating statistical error is not easily possible from a single WL simulation alone and normally requires statistics over independent runs. The most relevant form of bias in WL simulations is that of a false convergence, where the DOS estimate settles down on a deceptively smooth shape that is however not a faithful representation of the actual DOS [14]. Natu-rally, such problems are notoriously hard to detect if no independent information about the actual DOS is available.
Such difficulties apply in particular to systems with complex free-energy landscapes that are typically accompanied by frustration in the interactions such as in the protein-folding problem [15] and in the spin-glass systems that result from a combination of frustration and quenched disorder [16]. The latter may be viewed as prototypical classically-hard optimization problems, and they are so challenging that specialized hardware has been built to simulate them [17]. Recently, DOS estimation has become the focus of much attention in the context of a search for potential computational speedups from quantum versus classical devices in the specific form of adiabatic quantum computers [18]. The currently available commercial realization of this paradigm effectively samples low-lying energy configurations of spinglass samples that are coded into the couplers connecting an array of superconducting flux qubits. The properties of the resulting samples and the question of whether such devices indeed provide superior performance as compared to classical algorithms for some problem classes has been the subject of much recent debate [19][20][21][22][23][24][25][26][27][28]. The questions of reliable Boltzmann sampling aided by quantum annealers for machine learning applications [22][23][24] as well as the advantages that quantum annealing devices potentially hold when tasked with fairly sampling the groundstate manifolds of spin glasses with multiple minimizing configurations [25][26][27] are now a topic of considerable interest.
Our goal in this work is threefold. i) We devise techniques for verifiably benchmarking algorithms for sampling the density of states, designed to overcome the pitfalls of misinterpreting false convergences of entropic samplers. ii) Employing the above techniques, we demonstrate the difficulties in applying traditional algorithms for sampling the density of states to spin-glass instances. arXiv:1808.04340v1 [cond-mat.stat-mech] 13 Aug 2018 iii) We introduce a population annealing algorithm for estimating the DOS that allows for the intrinsic control of statistical and systematical errors and demonstrate how it can outperform the standard approach by orders of magnitude.

II. VERIFIABLE BENCHMARKING OF ENTROPIC SAMPLERS
As mentioned above, approximation algorithms for the density of states are not always reliable in converging to the correct answer, especially for the frustrated systems considered here. While there are some general results regarding the convergence of suitably modified WL type algorithms [11], convergence times can become astronomically large and convergence hard to assess from intrinsic indicators [29]. As we shall see below, for disordered systems they can also fluctuate wildly between different realizations of the couplings. It is hence highly desirable for benchmarking purposes to have at hand sets of samples that are sufficiently challenging for the tested algorithms, but for which nevertheless the (exact) density of states is known from other considerations. In general, such samples are not readily available, but we present here two groups of examples that are extremely useful in this respect: locally planar lattices and samples with planted solutions.
For concreteness, we shall consider spin models of the Ising type, whose cost function (or Hamiltonian) is of the form where i, j denotes the set of edges of the underlying graph. Here, s i = ±1 are the Ising spin variables and the quenched parameters J ij and h i denote the exchange couplings and random fields, respectively. In the following, we will focus on the zero-field case h i = 0. While the problems of computing the density of states (or partition function) and of finding a ground state for this problem in at least three dimensions are NP hard [30], both problems can be solved in polynomial time on any set of graphs of a genus bounded by a constant, which includes planar and toroidal two-dimensional lattices [5]. For such cases there exist efficient algorithms to solve the above problems. For ground states, these include minimum-weight perfect matching [31][32][33], while for the partition function or density of states the usual approaches are based on the counting of dimer coverings which can be achieved via an evaluation of Pfaffians [5,34,35]. Unfortunately, such techniques are restricted to locally planar graphs and so they do not apply, for example, to the Chimera graphs used in current implementations of quantum annealers, which have a genus that grows linearly in the number of sites. For such more general problems we propose here an approach that is based on generating problem instances for which the values and degen-eracies of the ground-and first excited-states, Ω(E 0 ) and Ω(E 1 ), are exactly computable. Since the states of lowest energies are usually the most difficult to sample, the degeneracies of these two energy levels are the most difficult to ascertain, and a correct estimation of these serves as a good indicator of true convergence. We create such samples by considering problem instances with planted solutions [19,36,37] -an idea borrowed from constraint satisfaction (SAT) problems [38,39] where the planted solution represents a ground-state configuration of Eq. (1) that minimizes the energy and is known in advance. Following Ref. [19], the Hamiltonian of a planted-solution spin glass is a sum of terms, each of which consists of a small number of connected spins, namely, H = j H j . Each term H j is chosen such that one of its ground states is the planted solution. It follows then that the planted solution is also a ground state of the total Hamiltonian. This class of instances has two attractive properties: i) the ground-state energies of the generated problems are known in advance, and ii) the exact degeneracies of the ground and first excited states are computable [25,28]. These in turn allow us to check for how close entropic samplers come to these exact values. The computation of Ω(E 0 ) and Ω(E 1 ) is based on the fact that our generated instances consist of a sum of terms, each of which has all minimizing configurations (of H) as its groundstate. To enumerate all ground states, we implement a form of the 'bucket' algorithm [40] designed to eliminate variables one at a time to perform an exhaustive search efficiently (for a detailed description of the algorithm, see the Supplementary Information of Ref. [25]). By noting that the lowest energy excited states are those configurations that violate precisely one clause, their degeneracy may also be calculated. We perform the same exhaustive search as above, but where now the configurations tested will correspond to first excited states of one of the H j (and are still minimizing for H i =j ). This gives the number of configurations which are ground states of H i =j and first excited states of H j ; by performing this calculation for each of the H j , we get the total number of first excited states of H.
While this approach in principle works for arbitrary graphs, we focus here on Chimera lattices, i.e., twodimensional arrays of unit cells of eight spins with a K 4,4 bipartite connectivity [41], see for example Ref. [28]. Our choice is motivated by the attention these graphs have gained in recent years in the context of optimization as well as sampling via quantum annealing as the quantum annealers currently commercially available feature qubits connected in a Chimera graph [42][43][44]. While the Chimera graph is two-dimensional in nature [45], it is also non-planar and as such gives rise to difficult spinglass problems [46]. We generated 592 planted-solution instances of 501 spins each [47], following a technique described in detail in Ref. [19] wherein the clauses H j are chosen to be 'frustrated loops' along the Chimera graph. For each sample we employed the bucket algorithm in order to obtain Ω(E 0 ) and Ω(E 1 ).
The combination of full exact density of states for samples on the square lattice and toroidal boundary conditions and of exact values for Ω(E 0 ) and Ω(E 1 ) for the Chimera samples allows us to carefully examine the reliability and performance of sampling schemes for estimating the density of states, avoiding the pitfalls provided by badly converged estimates of stochastic approximation schemes.

III. SAMPLING THE DENSITY OF STATES
The common approximation algorithms for the density of states are based on Markov chain Monte Carlo [8,13]. In the following, we will use the most popular of these, the Wang-Landau algorithm [8,9], in a variant dubbed the WL-1/t method [10] that in principle can be shown to converge to the correct answer if given infinite run time [11], as a reference and contender of the method introduced here, entropic population annealing.

A. Wang-Landau sampling
In Wang-Landau (WL) sampling as introduced in Refs. [8,9] a running estimateΩ(E) of the density of states (initialized asΩ(E) = 1 ∀E) is updated in a random walk through energy space by multiplying the estimate at the current energy E by a modification factor f (initially chosen to equal Euler's number e) at each step. A new configuration of energy E is proposed according to the chosen move scheme (for a spin model typically through single spin flips) and accepted with probability If, after some time, the histogramĤ(E) of all possible energies is found to be "sufficiently flat" (typically interpreted as no histogram bin having less than 80% of the average number of entries [8], but see also Ref. [48] for a related discussion), the modification factor is reduced as f → √ f , and the histogramĤ(E) is reset to an empty state. The algorithm stops if f is "sufficiently small", for example f = f final = exp(10 −8 ). While the approach was invented as a variant of Markov chain Monte Carlo, the fact that the transition probabilities according to Eq. (2) change constantly means that (detailed or global) balance does not hold and it is more useful to think of the method as a "stochastic approximation algorithm" [11].
It is well known that the original scheme of Ref. [8,9] does not converge to the true DOS, but the error asymptotically saturates at a value determined by the protocol used for reducing f [49]. This shortcoming is remedied by choosing a different modification protocol for f , leading to a slower decay of f at late times. The so-called 1/t algorithm proposed in Ref. [10] uses two phases. In the first phase the standard WL algorithm is used, with the only difference that the energy histograms are considered to be sufficiently flat already ifĤ(E) = 0 for all E. Once ln f falls below the moving threshold N E /t, where t is the simulation time measured in spin-flip attempts and N E is the number of energy levels, the simulation enters the second phase. There, the modification factor is adapted quasi continuously according to ln f (t) = N E /t and histogram flatness is no longer tested. The simulation stops once f (t) ≤ f final .
While no saturation of error occurs in the 1/t algorithm [10,14], it is still necessary to know the permissible energy levels (including the ground state) beforehand to judge histogram flatness, which is a major drawback of the method for disordered systems. For large systems and problems with complex free-energy landscapes, it is usually necessary to divide the total energy range into several windows for which the algorithm is run separately to achieve convergence on realistic time scales for interesting system sizes [9]. The right choice of window sizes in such schemes is a difficult problem especially for disordered and frustrated systems [50]. A number of further generalizations of the method have been proposed, for instance a combination with parallel tempering [12] which uses progressively smaller windows at lower energies, but here again there is no general algorithm for determining the appropriate window sizes automatically.

B. Entropic population annealing
The new algorithm introduced here, which we call entropic population annealing (EPA), is not based on Markov chains but on the sequential Monte Carlo method. Population annealing (PA) was first studied in Refs. [51,52] and more recently developed further in Refs. [53][54][55][56][57][58][59]. It is based on the initialization of a population of replicas drawn from the equilibrium distribution at high temperatures, which is then subsequently cooled to lower and lower temperatures. During this process, a combination of population control and spin flips is used to ensure that the ensemble remains in equilibrium. The simulation entails the following steps [53, 56]: 1. Set up an equilibrium ensemble of R 0 = R independent copies (replicas) of the system at inverse 2. Take a step to inverse temperature β i > β i−1 by resampling the configurations j = 1, . . . , R i−1 with their relative Boltzmann weightτ i (E j ), leading to R i = R i−1 replicas, in general.
3. Update each replica by θ rounds of an MCMC algorithm at inverse temperature β i .
4. Goto step 2 unless the inverse target temperature β f has been reached.
During resampling, the expected number of copies iŝ with a normalizing factor The actual number of copies is taken to be the integer part τ i (E j ) plus an additional copy added with a probability corresponding to the fractional part,τ i (E j )− τ i (E j ) . While initially, constant (inverse) temperature steps were used on increasing β i > β i−1 [53], it turns out that a better, parameter-free method consists of choosing β i to ensure a certain overlap of the energy distributions between the two temperatures [56]. This overlap can be computed from the resampling factors, and β i is adapted using a bisection search such as to ensure an overlap α * of energy histograms. The method is not very sensitive to the precise value of α * , and we choose α * = 0.86 in the runs below. While the algorithm described above is just population annealing [53] improved by adaptive temperature steps [56,58], the possibility of sampling the entropy arises from a combination of the method with multi-histogram techniques [60]. An estimator of the free energy follows directly from the resampling factors, where Z β0 is the partition function at the initial temperature β 0 . In the following, we always choose β 0 = 0, such that simply Z β0 = 2 N , where N is the number of spins. We can then estimate the density of states by combining the histograms at all temperature steps, and a varianceoptimized estimator is given by [56,60], Here, N β is the total number of temperatures, and the energy histogramĤ βi (E) at inverse temperature β i is normalized such that EĤ βi (E) = R i . The approach is naturally suited for (moderately or massively) parallel calculations as the R replicas are simulated independently of each other and the only interaction occurs during resampling. An efficient GPU implementation was discussed in Ref. [56]. Importantly for our application, EPA does not require any prior knowledge of the range of realized energies. A detailed analysis of systematic and statistical errors of PA can be found in Ref. [61]. As we shall see below, EPA performs significantly better at estimating Ω(E) for hard spin-glass samples than the WL-1/t algorithm.

IV. RESULTS
In order to test the efficiency of EPA against the WL-1/t algorithm in an objective way that is unaffected by problems of false convergence, we applied the two algorithms to the planted solutions on Chimera graphs as well as to the stochastic ±J model on the square lattice with periodic boundaries, for both of which we have exact results. As a baseline for the comparison, we applied both methods to the case of the Ising ferromagnet on a square lattice for which extensive exact results are available. There, we find similar performance of the two methods, cf. the discussion in App. A.

A. Ising spin glasses on Chimera graphs
We then considered the Chimera spin-glass instances with planted solutions and N = 501 spins. In order to be able to compare the two algorithms on an equal footing, we could have directly considered them to be allowed the same runtime. This measure, however, is implementation and platform specific. Since both algorithms spend most of their time flipping spins, we compare them for simulations employing the same number 2 × 10 12 of spin-flip attempts. For the used (serial) code for WL-1/t this corresponded to a wallclock time of 37 h (on an Intel Xeon 2.4 GHz CPU), while for EPA we used a massively parallel GPU program [56] that took approximately 1.5 h (on an Nvidia Tesla K40 GPU) per realization.
As mentioned above, for WL it is required to know the allowed energy levels to decide about the flatness of histograms. This knowledge was here acquired by a pre-run of the WL type employing 2×10 11 spin-flip attempts and with a fixed modification factor ln f = 1 to explore the energy landscape (the corresponding run-time is included in the time estimate given above). This knowledge is not required for the EPA runs. Without any restriction on the range of allowed energies, WL-1/t did not complete phase 1 for the vast majority of samples. We therefore reduced the energy range to [E 0 , E 0 + 1200] in dimensionless units, which corresponds to about a third of the average range between the ground state and E = 0. The simulation was started in a random configuration within that energy range. With that restriction, 594 out of 625 samples completed phase 1 within 2 × 10 12 flip attempts. No range restriction was required for EPA, and we used a population of size R = 3992000 with θ = 10 rounds of spin flips per resampling step and a histogram overlap α * = 0.86, resulting in typically 100 temperature steps [62].
We note that both EPA and WL-1/t intrinsically estimate entropy differences, i.e., ratios of degeneracies for neighboring energy values, and the absolute scale is only achieved through an additional normalization such as that given by Z β0 in Eq. (5). It is therefore reasonable to study their performance in estimating the ratio of degeneracies of first excited and ground states. In Fig. 1 we show the relative deviations of the ratios r 10 from the exact values known through the planting as estimated from WL-1/t and EPA. WL-1/t found the correct ground-state energy for 622 of the 625 samples. For some samples the relative deviations are so large that they exceed the scale of the plot of Fig. 1, some by many orders of magnitude [63]. These samples are shown at the boundary of the box and in a different color. It is clear that for most samples the deviations are substantially smaller for EPA than for WL-1/t. In total, EPA outperforms WL-1/t in 89% of the instances. The error of WL-1/t is larger than 10% for 39% of samples and it is difficult to distinguish between the accurate and inaccurate WL results. In contrast, the EPA results are accurate to within 7% for all of the 625 samples.

B. Ising spin glasses on toroidal graphs
For planar or otherwise two-dimensional lattices of a fixed genus, a counting of dimer coverings and the corresponding evaluation of Pfaffians can be used to determine the full density of states in polynomial time [5,34,35]. We study toroidal graphs, i.e., L×L patches of the square lattice with periodic boundary conditions using the implementation proposed in Ref. [5] which has an asymptotic run-time scaling of O(L 5 ). Using this approach, we studied 1000 samples with a standard ±J coupling distribution and 32 × 32 spins. Since we have the full DOS, a more challenging benchmark for the sampling techniques involves a comparison of the results for all energy levels. We use the simulation algorithms based on Wang-Landau sampling and population annealing to study the total deviation ∆ of the simulation results from the exact density of states, where While for PA an absolute normalization of Ω(E) [64] results from the free-energy estimator (Eq. 5) in combination with (Eq. 6), WL-1/t as described above only yields the DOS up to an overall factor. To fix the latter we use the fact that the sum Ω(E) over all energy levels must equal the total number 2 N of states. Note that different ways of normalization of WL-1/t lead to quite different fluctuations of the final DOS estimates, and the normalization via the total number of states used here leads to the best results, see the discussion in App. B. Since EPA only samples states with energies E 0, we restricted the energy range for WL-1/t to E ≤ 50 to ensure a fair comparison [65]. For WL-1/t we used 2×10 12 spin-flip attempts in the main run. Just as for the Chimera samples, a pre-run was required to determine the range of possible energies, for which an additional 2 × 10 11 updates were used. For the EPA algorithm, we used a population size R = 2340000, and performed θ = 19 rounds of spin-flips between two resampling steps. The required histogram overlap of α * = 0.55 resulted in N β = 44 temperature steps for most disorder realizations. The total number of spin flips in these EPA runs is hence also approximately 2 × 10 12 .
The resulting values for the average relative deviation of the level entropies according to Eq. (7) are shown in the scatter plot of Fig. 2. We considered 1000 samples in total. For the WL-1/t 950 found all energies of the spectrum, but only 911 of these did complete phase 1 of the algorithm. For the EPA simulations no such problems occurred, and it is clear that it performs substantially better than WL-1/t, the deviation ∆ being smaller for EPA for 914 of the 1000 samples, in some cases by several orders of magnitude.

C. Entropic sampling of problems of varying hardness
Having ascertained that the EPA method yields significantly better approximations to the DOS of some hard problems in a given number of steps than the WL approach, it is interesting to analyze more closely the actual distributions of performances of these algorithms over the space of disorder realizations. It is already clear from Figs. 1 and 2 that the quality of approximation is much more heterogeneous for WL-1/t with large fluctuations in the accuracy than for EPA with relatively little scatter in the accuracy for different disorder realizations. To study this effect quantitatively, we used a set of disorder samples classified according to their thermal hardness. A well established measure of such hardness is the exponential autocorrelation or relaxation time in parallel tempering simulations [66,67]. As very long simulations are required to determine these time scales accurately, a number of proxy quantities such as the so-called "tunneling time" are frequently used in practical applications [68,69]. Here, we rely on a method developed in the context of spin-glass simulations that analyses the dynamics of the random walk of replicas in temperature space [70] and extracts the corresponding relaxation times τ .
To benchmark the EPA algorithm against WL-1/t, we generated about 10 6 random instances on an N = 512-spin Chimera graph (of which only 476 spins were used) and measured the relaxation times τ of each instance with parallel tempering [71]. Next, we grouped together instances with similar classical hardness, i.e., similar relaxation times τ , 10 k ≤ τ ≤ 3 · 10 k for k = 3, 4, 5, 6 and 7. For each such 'generation' of τ , we randomly picked 100 representative instances for the benchmarking of the algorithm (only 14 instances with k = 7 were found). We then performed WL-1/t simulations with a total of 10 12 spin-flip attempts for all samples, evenly split between one simulation each restricted to the energy windows (−900, −500) and (−550, 50), respectively (the groundstate energy for these samples is roughly E 0 ≈ −800). A pre-run of 2 × 10 11 spin-flip attempts was again used to discover the range of possible energies for each sample. The main runs in the first energy window were started in the lowest-energy state found in the pre-run. The main runs in the second energy window were started in a random configuration within the energy range. All runs completed the first phase of the simulation here, owing to the use of two energy windows. The two DOS segments obtained by WL-1/t were sewn together by matching the estimates at a point in the intersection of the two windows. For the PA runs we used R = 2.1×10 6 , θ = 10, α * = 0.86, corresponding to N β ≈ 100 temperature steps and 10 12 spin flip attempts. The DOS estimates from both methods are only considered for E ≤ 0 and Ω(E) = Ω(−E) is used for E > 0. The resulting DOS estimate is normalized using the known total number of states 2 N . We quantify the success of the two algorithms in estimating the DOS of these samples of varying hardness by determining the level of fluctuation in the estimates of ln Ω(E) between independent runs, both for the WL-1/t and EPA methods. A large fluctuation in the standard deviation σ implies the algorithm often struggles to converge. Standard deviations σ[ln Ω(E)] are estimated from 200 independent runs for each method and disorder realization. In Fig. 3 we show these standard deviations, averaged over all energy levels, for the samples of the different hardness classes k = 3, 4, . . .. It is clear that EPA is much less affected by sample hardness than WL-1/t: while the fluctuations grow substantially with τ for WL, there is virtually no dependence of precision on hardness for EPA. For the easiest samples, WL-1/t has less fluctuations between runs, but for the higher hardness classes, the situation is reversed. Note that this quantity only covers the effect of statistical errors, whereas the data in Figs. 1 and 2 considered the total deviation from the exact results that also includes bias effects. Note also that the sample-to-sample fluctuations, represented in the error bars of the data points in Fig. 3, are orders of magnitude larger for WL-1/t that for EPA. We find that WL-1/t and EPA have rather different behavior in sampling the DOS in different energy ranges, with WL-1/t being more focused on higher energies, for details see App. C.
While in the present demonstration we used relaxation times from parallel tempering for classifying sample hard- ness, it is worthwhile to note that EPA can itself provide a hardness measure and thereby differentiate easy and hard samples. A few such quantities have been previously proposed for population annealing [54]. We consider here in particular the (temperature dependent) mean-square family size ρ t , defined as [54] where n i is the fraction of the current population that descends from the i-th member of the initial population at β 0 = 0, while R corresponds to the initial population size. The quantity R/ρ t can be understood as an effective population size, corresponding to the number of statistically independent replicas, such that R/ρ t = R corresponds to a perfectly uncorrelated population, while R/ρ t → 0 for the strongest correlations. These two limits hence represent the easiest and hardest samples, for which one would expect τ → 0 and τ → ∞, respectively, for parallel tempering. A related quantity that also takes the decorrelating effect of spin flips into account is the effective population size R eff defined in Ref. [72]. In Fig. 4 we show a scatter plot of ρ t for the 100 samples of hardness classes k < 7 and the 14 samples for k = 7, respectively. The disorder average of ρ t at β = 3 is found to be 49, 135, 420, 663 and 840 for k = 3, 4, 5, 6, and 7, respectively, indicating that while for the main part of the distribution the hardnesses in EPA and parallel tempering are strongly correlated, for the tails of the distribution the hardness in EPA increases more gently than that found in parallel tempering. As was demonstrated elsewhere, these intrinsic hardness measures can be used to make population annealing simulations adaptive to the sample hardness [58,61]. We note that the planted samples of Sec. IV A have an average ρ t of ≈ 2000 (see Appendix D), indicating that planted samples of this type are much harder than random ones.

V. SUMMARY AND DISCUSSION
We have investigated the performance of sampling methods for estimating the density of states for systems with complex free-energy landscapes, focusing on samples of spin glasses. We proposed a novel sampling technique based on sequential Monte Carlo on a large population of copies and demonstrated that it outperforms the most widely used entropic sampler, the Wang-Landau algorithm, in the vast majority of cases -in some situations providing improvements by several orders of magnitude thereby allowing for the reliable studying of spinglass degeneracies for larger problems than ever before.
A notorious problem with benchmarking algorithms for estimating the density of states lies in safely assessing convergence. Here we solve this issue by either considering planar samples where Pfaffian methods can be used to determine the density of states exactly for systems with more than 1000 spins, or by studying samples with planted solutions on Chimera graphs for which the exact degeneracies of the ground and first excited states can be calculated using a 'bucket' algorithm. Both classes provide hard optimization problems, implying a very nontrivial benchmark.
One of the advantages of the approach based on population annealing is that it does not require any prior knowledge about the energy spectrum, which in contrast needs to be acquired in an additional pre-run for the Wang-Landau method. The notorious problem of premature and false convergence that plagues the latter approach is not so much of an issue for the newly introduced technique, where a re-distribution of weights can occur at all stages of the algorithm. The main advantage of the approach, however, lies in the ideal suitability for massively parallel calculations, where given sufficient parallel resources the accuracy of the approximation can be arbitrarily improved at a constant wall-clock time by increasing the size of the population.
The type of problem set we study, namely spin glasses, is very relevant as current experimental quantum annealers attempt to solve precisely this type of problem. With questions still lingering about which distribution these devices sample from [28], it is important to have an accurate tool to estimate the density of states (for instance to understand thermalization [28]). For this problem we specifically consider instances with vastly different hardnesses, confirming that the accuracy of the technique proposed degrades significantly less for harder samples that previous approaches.
Our results are very promising as we could clearly show that i) for a range of problems with complex energy landscapes existing sampling methods for the density of states are difficult to set up and do not converge very reliably, especially for hard samples; ii) a dependable method for the benchmarking of entropic samplers on spin glasses is now available, which we hope will drive research forward in finding even better algorithms; and iii) the entropic population annealing (EPA) algorithm devised here allows for the reliable sampling of large-scale frustrated systems. We therefore trust that the algorithm will become a useful tool for density of states calculation in condensed matter physics, quantum computing and other areas of research.  The case of the Ising ferromagnet on the square lattice, corresponding to the Hamiltonian (1) with J ij = 1 for all nearest-neighbor pairs (i, j), has served as a standard benchmark for entropic samplers since they were first considered [8]. In this case, the DOS can be exactly computed using methods that are somewhat simpler that those of Ref. [5], see Ref. [4]. Fig. 5 shows the relative deviation of the ln Ω(E) obtained via EPA from the exact DOS for system size L = 64. Similar plots can be found for the WL method in Refs. [8,9] and subsequently in many papers on improved methods. The parameters used for the EPA run are R = 100 000, θ = 10, and α * = 0.86, corresponding to N β ≈ 230 temperature steps and 9.4 × 10 11 spin flips. It is apparent that a high accuracy is achieved across the whole energy range.

ACKNOWLEDGMENTS
In Table I we compare the average deviation ∆ according to Eq. (7) for different runs of the WL-1/t and EPA algorithms. The WL-1/t runs obtained the full DOS, while EPA simulations obtained the degeneracies of energy levels with E 0 and we exploited the symmetry Ω(−E) = Ω(E). The accuracy of WL-1/t is approxi-mately the same as the accuracy of EPA for 5 < θ < 10, for the same number of spin flips. Introducing an automatic adaptation of θ for EPA results in an additional approximately threefold reduction of ∆ [61].

Appendix B: Normalization of the density-of-states estimates
While EPA provides the DOS with its absolute normalization, this is not natively the case for WL. For a fair comparison hence different possible normalizations should be considered. Two main normalizations schemes are known, either by fixing the value of Ω(E * ) at a specific energy, for example at the ground state E * = E 0 or at E * = 0. Alternatively, one can use the fact that the total number of states is 2 N , i.e., E Ω(E) = 2 N .
In Fig. 6 we show the effect of different normalizations applied to the density-of-states estimate resulting from an EPA simulation (top panel) and a WL-1/t run (bottom panel) for a single realization (different realizations provide comparable relative performances). It is apparent that different normalizations lead to rather different statistical fluctuations for different energies, and that EPA and WL-1/t behave quite differently in this respect. Clearly, fixing the DOS at a specific E * leads to zero fluctuations at this point. Averaged over all energies, however, it is found that the normalization by the total number of states leads to the lowest fluctuations, and for EPA these are very similar to those found from the intrinsic normalization of EPA. It is also possible to consider the entropy differences ln Ω(E) − ln Ω(E + 2) etc. that are intrinsically normalization independent and also feature relatively low levels of fluctuation. The two of the curves are almost identical in the bottom panel in Fig. 6, because the DOS has its largest value at E = 0, hence the degeneracy is most accurately estimated by WL-1/t at E = 0.

Appendix C: Precision in different energy ranges
A closer comparison of WL-1/t and EPA is achieved by comparing the achieved precision with the same computational effort, but resolved by energies. To this end, we studied the behavior of σ(ln Ω(E)−ln Ω(E +2)), the fluctuation of the estimate in differences in level entropies. Simulations were performed with both methods and the same set of samples using the parameters described in the main text that ensure that the same number of spin flips is performed. Figure 7 shows the result one sample of each hardness class k = 3, 4, 5, 6, and 7. One can see that EPA results in a relatively small σ for low energies, especially for hard samples. At the same time, WL results in smaller values of σ[ln Ω(E) − ln Ω(E + 2)] • σ(ln Ω(E)) = intrinsic normalization of EPA ▲ σ(ln Ω(E)-ln Ω(E 0 )) ▼ σ(ln Ω(E)-ln Ω(0)) □ σ(ln Ω(E)-ln Ω(-500)) ◇ normalization to the total number of states for high energies. This seems plausible as WL spends much more time the higher energies with much larger entropies, whereas in PA the population is continuously moved from high to low energies.

Appendix D: Hardness of the planted ensemble
It is interesting to consider some of the hardness measures for the EPA method also for the other ensembles discussed here. Figure 8 shows the mean-square family size ρ t at the lowest temperature in EPA for the 625 planted samples on Chimera graphs (see Sec. IV A). The average value is ρ t ≈ 2000, so the planted samples of this type are much harder then the random ones (see Sec. IV C).  7. Standard deviations of the differences ln Ω(E) − ln Ω(E + 2) of level entropies as sampled by the WL-1/t and EPA approaches for a single sample from each hardness class, k = 3 (a), k = 4 (b), k = 5 (c), k = 6 (d), and k = 7 (e). WL-1/t was performed with 4.8 × 10 11 spin flip attempts for all samples, restricting the walk to energies E ≤ Emax, where Emax = −500 (the ground-state energy for these samples is roughly E0 ≈ −800). A pre-run of 2 × 10 11 spin-flip attempts was performed to discover the range of possible energies for each sample; the main runs were started in the lowest-energy state found in the pre-run. All runs completed the first phase of the simulation here. For the EPA runs we used R = 10 6 , θ = 10, α * = 0.86, corresponding to N β ≈ 100 temperature steps and 4.8 × 10 11 spin flip attempts.