Message Passing Variational Autoregressive Network for Solving Intractable Ising Models

Many deep neural networks have been used to solve Ising models, including autoregressive neural networks, convolutional neural networks, recurrent neural networks, and graph neural networks. Learning a probability distribution of energy configuration or finding the ground states of a disordered, fully connected Ising model is essential for statistical mechanics and NP-hard problems. Despite tremendous efforts, a neural network architecture with the ability to high-accurately solve these fully connected and extremely intractable problems on larger systems is still lacking. Here we propose a variational autoregressive architecture with a message passing mechanism, which can effectively utilize the interactions between spin variables. The new network trained under an annealing framework outperforms existing methods in solving several prototypical Ising spin Hamiltonians, especially for larger spin systems at low temperatures. The advantages also come from the great mitigation of mode collapse during the training process of deep neural networks. Considering these extremely difficult problems to be solved, our method extends the current computational limits of unsupervised neural networks to solve combinatorial optimization problems.


I. INTRODUCTION
Many deep neural networks have been used to solve Ising models [1,2], including autoregressive neural networks [3][4][5][6][7][8], convolutional neural networks [9], recurrent neural networks [4,10], and graph neural networks [11][12][13][14][15][16][17].The autoregressive neural networks model the distribution of high-dimensional vectors of discrete variables to learn the target Boltzmann distribution [18][19][20][21] and allow for directly sampling from the networks.However, recent works question the sampling ability of autoregressive models in highly frustrated systems, with the challenge resulting from mode collapse [22,23].The convolutional neural networks [9] respect the lattice structure of the 2D Ising model and achieve good performance [3], but cannot solve models defined on non-lattice structures.Variational classical annealing (VCA) [4] uses autoregressive models with recurrent neural networks (RNN) [10] and outperforms traditional simulated annealing (SA) [24] in finding ground-state solutions of Ising problems.This advantage comes from the fact that RNN can capture long-range correlations between spin variables by establishing connections between RNN cells in the same layer.Those cells need to be computed sequentially, which results in a very inefficient computation of VCA.Thus, in a particularly difficult class of fully connected Ising models, Wishart planted ensemble (WPE) [25], Ref. [4] only solves problem instances with up to 32 spin variables.Since a Hamiltonian with the Ising form can be directly viewed as a graph, it is intuitive to use graph neural networks (GNN) [26][27][28] to solve it.While a GNN-based method [16] has been employed in combinatorial optimization problems with system sizes up to * gaoming@nudt.edu.cnmillions, which first encodes problems in Ising forms [29] and then relaxes discrete variables into continuous ones to use GNN, some researchers argue that this method does not perform as well as classical heuristic algorithms [30,31].In fact, the maximum cut and maximum independent set problem instances with millions of variables used in Ref. [16] are defined on very sparse graphs and are not hard to solve [31].Also, a naive combination of graph convolutional networks (GCN) [28] and variational autoregressive networks (VAN) [3] (we denote it as 'GCon-VAN' in this work) is tried, but performs poorly in statistical mechanics problems defined on sparse graphs [8].Reinforcement learning has also been used to find the ground state of Ising models [32,33].In addition, recently developed Ising machines have been used to find the ground state of Ising models and have shown impressive performance [34], especially those based on physicsinspired algorithms, such as SimCIM (simulated coherent Ising machine) [35,36] and simulated bifurcation (SB) [37,38].
Exploration of new methods to tackle Ising problems of larger scale and denser connectivity is of great interest.For example, finding the ground states of Ising spin glasses on two-dimensional lattices can be exactly solved in polynomial time, while ones in three or higher dimensions is a non-deterministic polynomial-time (NP) hard problem [39].Ising models correspond to some problems defined on graphs, such as the maximum independent set problems, whose difficulty in finding the ground state might depend on node's degree being larger than a certain value [40,41].Design of neural networks to solve Ising models on denser graphs would lead to development of powerful optimization tools and further shed light on computational boundary of deep-learning-assisted algorithms.
Due to the correspondence between Ising models and graph problems, existing Ising-solving neural network methods can be described by the message passing neural networks (MPNN) framework [42].MPNN can be used to abstract commonalities between them and determine the most crucial implementation details, which help to design more complex and powerful network architectures.Therefore, we reformulate existing VAN-based network architectures into this framework and then explore more variants for designing new network architectures with meticulously designed message passing (MP) mechanisms to better address intractable Ising models.
Here we propose a variational autoregressive architecture with a message passing mechanism and dub it message passing variational autoregressive network (MPVAN).It can effectively utilize the interactions between spin variables, including whether there are couplings and coupling values, while previous methods only considered the former, i.e., the correlations.
FIG. 1.The residual energy histogram on the WPE with system size N = 60 and difficulty parameter α = 0.2, which makes problem instances hard to solve.The residual energy is defined as the difference between the energy of the configurations sampled directly from the network after training and the energy of the ground state.Each method contains 9 × 10 6 configurations obtained from 30 instances and each for 30 runs.
We show the residual energy histogram on the Wishart planted ensemble (WPE) [25] with a rough energy landscape in Fig. 1.Compared to VAN [3], VCA [4], and GCon-VAN [8], the configurations sampled from MP-VAN are concentrated on the regions with lower energy.Therefore, MPVAN has a higher probability of obtaining low-energy configurations, which is beneficial for finding the ground state, and it is also what combinatorial optimization is concerned about.
Numerical experiments show that our method outperforms existing methods in solving two classes of disordered, fully connected Ising models with extremely rough energy landscapes, the WPE [25] and the Sherrington-Kirkpatrick (SK) model [43], including more accurately estimating the Boltzmann distribution and calculating lower free energy at low temperatures.The advantages also come from the great delay in the emergence of mode collapse during the training process of deep neural networks.Moreover, as the system size increases or the connectivity of graphs increases, MPVAN has greater advantages over existing methods in giving a lower upper bound to the energy of the ground state.Comparing to short-range Ising models such as the Edwards-Anderson model [44], infinite-ranged interaction models (SK model and WPE) we considered are more challenging since there exist many loops of different lengths, which leads to more complicated frustrations.Considering these extremely difficult problems to be solved, our method extends the current computational limits of unsupervised neural networks [45] to solve intractable Ising models and combinatorial optimization problems [46].
The paper is structured as follows.In Sec.II, we provide a detailed description of the message passing variational autoregressive network and provide a theoretical analysis.In Sec.III, we conduct experiments to benchmark our method and existing methods for solving intractable Ising models.We conclude and discuss in Sec.IV.

II. MESSAGE PASSING VARIATIONAL AUTOREGRESSIVE NETWORK
The message passing variational autoregressive network (MPVAN) is to solve Ising models with the Hamiltonian as where {s i } N i=1 ∈ {±1} N are N spin variables, and ⟨i, j⟩ denotes that there is a non-zero coupling J ij between s i and s j .
MPVAN is composed of an autoregressive message passing mechanism and a variational autoregressive network architecture, and its network architecture is shown in Fig. 2(a).The input to MPVAN is configurations s = {s i } N i=1 in a predetermined order of spins, and the i th component of the output, ŝi , means the conditional probability of s i taking +1 when given values of spins in front of it, s <i , i.e., ŝi = q θ (s i = +1|s <i ).
As in VAN [3], the variational distribution of MPVAN is decomposed into product of conditional probabilities as where q θ (s) represents the variational joint probability and q θ (s i |s 1 , s 2 , . . ., s i−1 ) denotes the variational conditional probability, both of which are parametrized by trainable parameters θ.Under the MP mechanism used in VAN [3], message passing is not performed, which is equivalent to the identity transformation of h3.Under the MP mechanism used in GCon-VAN [8], message passing performs according to the adjacency matrix A, which updates the h3 based on the topology structure of the graph.For the Graph MP mechanism we designed, message passing is performed by using the couplings Jij of the Hamiltonian, which updates h3 based on the couplings and Z3 = |J31| + |J32|.The Hamiltonians MP mechanism we designed updates h3 based on the couplings and values of neighboring spins s1 and s2, which is also the message passing mechanism used in MPVAN.

A. MPVAN Layer
MPVAN are constructed by stacking multiple message passing variational autoregressive network layers.A single MPVAN layer is composed of an autoregressive message passing process and nonlinear functions with trainable parameters defined by VAN [3].
The input to the MPVAN layer is a set of node features, h = { ⃗ h 1 , ⃗ h 2 , . . ., ⃗ h N }, ⃗ h i ∈ (0, 1) F , where F is the number of training samples.The layer produces a new set of node features, where sigmoid activation function σ(x) = 1 1+e −x ranging in (0, 1) and thus h o i ∈ (0, 1).The ⟨h⟩ M P = {⟨h 1 ⟩ M P , ⟨h 2 ⟩ M P , . . ., ⟨h N ⟩ M P }, ⟨h i ⟩ M P ∈ R, denotes the updated node features from h by message passing.The W and b are layer-specific trainable parameters, and W is a triangular matrix to ensure the autoregressive property [3,[19][20][21].
To show how to get ⟨h⟩ M P , we first review the message passing mechanism defined in the MPNN framework [42].We describe message passing operations on the current layer with node features h i and edge features J ij .The message passing phase includes how to obtain neighboring messages m i and how to update node features h i , which are defined as where h j is the node feature of j, and which denotes the neighbors located before node i.The N a (i) is used to preserve autoregressive property, which is different from general message passing mechanisms [42] and graph neural networks [28,47,48].The message aggregation function M (h i , h j , J ij ) and node feature update function U (h i , m i ) are different across message passing mechanisms.Now, existing VAN-based methods can be reformulated into combinations of VAN and different MP mechanisms.Then, we explore their variants and propose our method.
In VAN [3], there is no message passing process from neighboring nodes.Therefore, the node features h i are updated according to which is the first MP mechanism in Fig. 2(b).Another successful variational autoregressive network approach is variational classical annealing (VCA) [4], which uses an RNN architecture to take into account the correlation between hidden units in the same layer.The network structure of VCA is a special RNN architecture designed according to the topology of the model to be solved, thus taking into account the features of neighboring nodes through trainable parameters.Therefore, it is difficult to represent VCA as MPVAN with a special MP mechanism, since MP is free of trainable parameters.
In GCon-VAN [8], a combination of GCN [28] and VAN, the ⟨h i ⟩ M P are obtained by where A is the adjacency matrix of the graph, and deg(i) represents the degree of node i.The h i is updated based on the connectivity of the graph, which is the second MP mechanism in Fig. 2(b).However, GCon-VAN performs poorly on sparse graphs in calculating physical quantities such as correlations and free energy [8] and it performs even worse on dense graphs in our trial.It may be because GCon-VAN only considers connectivity and ignores the weights of neighboring node features.Also, from the results of VAN in Fig. 1, the node feature h i itself should be highlighted rather than the small weight 1 deg(i)+1 in Eq. (7).Therefore, we explore more variants and propose three MP mechanisms.The m i are obtained by where In above three mechanisms, the ⟨h i ⟩ M P are obtained by For the neighboring message m i in Eq. (8a), we consider the weight of neighboring node features instead of averagely passing those features in Eq. (7).Also, we increase the weight of h i in Eq. ( 10) for all mechanisms we designed.However, the Eq.(8a) ignores the influence of the sign of couplings {J ij }.
Thus, we propose the MP mechanism made of Eq. (8b) and Eq.(10), which is the third MP mechanism in Fig. 2(b).It uses the values of couplings {J ij } in the Hamiltonians to weight the neighboring node features.Since it is based on the graph defined by the target Hamiltonian, we dub it 'graph message passing mechanism' (Graph MP).
Previous methods and the above two MP mechanisms we designed do not make full use of the interactions between spin variables of the Hamiltonian in Eq. ( 1), and known values of s <i when updating the h i .Intuitively, it may be helpful to consider those interactions and values of s <i in the message passing process.
Therefore, we propose the Hamiltonian MP mechanism composed of Eq. (8c) and Eq. ( 10), which is the fourth MP mechanism in Fig. 2(b) and also the mechanism used in MPVAN.The s j = ±1 is the known value of the neighboring spin j, and h ′ j ∈ (0, 1) represents the probability of the spin j taking s j .
Based on the above definitions, it is reasonable to use h ′ j rather than h j in message passing.To illustrate, suppose for the neighboring spin j, s j = −1 and h j = 0.2.Then, if we repeatedly sample spin j, we could obtain s j = −1 with a probability of 0.8.It means that neighboring spin j should have a greater impact on h i when it takes −1 than +1, but h j = 0.2 is not as good as h ′ j = 0.8 to reflect the great importance of s j = −1.On the other hand, suppose for the neighboring spin j, s j = −1 and h j = 0.8.We could obtain s j = −1 with a probability of 0.2 if we sample spin j again.At this time, using h ′ j instead of h j could show little importance of s j = −1.So we think it makes sense to use h ′ j to reflect the effect of the spin j taking s j in message passing.
Taking the graph with 4 nodes and 3 edges in Fig. 2(b) as an example, applying the above four message passing mechanisms, the ⟨h 3 ⟩ M P are obtained as We also consider more MP mechanisms and compare their performance in Appendix.I, where Hamiltonians MP always performs best.In addition, since an arbitrary message passing variational autoregressive network is constructed through stacking MPVAN layers, we also discuss the effect of the number of layers on the performance.MPVAN exhibits characteristics similar to GNN, i.e., there exists an optimal number of layers, which can be found in Appendix.II.

B. Training MPVAN
We then describe how to train MPVAN.In alignment with the variational approach employed in VAN, the variational free energy is used as loss function, where β = 1/T is inverse temperature, and E(s) is the H of Eq. ( 1) related to a given configuration s.The configuration s follows Boltzmann distribution p(s) = e −βE(s) /Z, where Z = s e −βE(s) .Since the KL divergence between the variational distribution q θ and the Boltzmann distribution p is defined as D KL (q θ ||p) = s q θ (s)ln( q θ (s) p(s) ) = β(F q − F ) and is always nonnegative, F q is the upper bound to the free energy The gradient of F q with respect to the parameters θ is (13) With the computed gradients ▽ θ F q , we iteratively adjust the parameters of the networks until the F q stops decreasing.
MPVAN is trained under an annealing framework, i.e., starting from the initial temperature T initial and gradually decreasing the temperature by annealing N annealing steps until the end temperature T f inal .During each annealing step, we decrease the temperature and subsequently apply N training gradient-descent steps to update the network parameters, thereby minimizing the variational free energy F q .As with the VAN [3], the network is trained using the data produced by itself.After training, we can sample directly from the networks to calculate the upper bound to the free energy and other physical quantities such as entropy and correlations.

C. Theoretical Analysis for MPVAN
Compared to VAN[3], MPVAN has an additional Hamiltonians message passing process.In this section, we will provide a theoretical and mathematical analysis of the advantages of the Hamiltonians message passing mechanism in MPVAN in Corollary. 1 below.
The goal of MPVAN is to be able to accurately estimate the Boltzmann distribution, i.e., configurations with low energy have a high probability and configurations with high energy have a low probability.Specifically, MPVAN is trained by minimizing the variational free energy F q , composed of E s∼q θ (s) E(s) and E s∼q θ (s) ln q θ (s).
Corollary 1.The Hamiltonians message passing process makes E s∼q θ (s) E(s) and E s∼q θ (s) ln q θ (s) smaller, and therefore variational free energy F q smaller compared to no message passing.Proof.We discuss message passing process that makes E s∼q θ (s) E(s) and E s∼q θ (s) ln q θ (s) smaller separately.
When training MPVAN, it is impossible to exhaust all configurations to calculate the variational free energy F q , so we use the mathematical expectation of training samples to estimate it.Therefore, we have where s k is k th training samples and N s is the number of all training samples.Consider the Hamiltonians message passing process (composed of Eq. (8c) and Eq. ( 10)) for updating the h i as an example to analyze how message passing makes E s∼q θ (s) E(s) smaller.Since the message passing process maintains autoregressive property, making E(s) smaller for any configuration is equivalent to making the local Hamiltonian defined as H local = − j<i J ij s i s j smaller, when given the values of its neighboring spins s <i .
Since H local = −s i j<i J ij s j with known j<i J ij s j , we are concerned about how the message passing affects the probability of s i taking +1 or −1.According to Eq. ( 10), the value of j<i J ij s j h ′ j plays an important role in that, and we discuss in 2 cases.
Case 1: if j<i J ij s j h ′ j > 0, then according to Eq. ( 10), we have i.e.,

P r[⟨s
where The Bernoulli(p) denotes sampling from the Bernoulli distribution to output +1 with a probability p.Thus, we have where and P r[• • • ] denotes the probability of something.
Case 2: if j<i J ij s j h ′ j < 0, then according to Eq. ( 10), we have i.e.,

P r[⟨s
and also obtain the Eq.(18).Therefore, regardless of the value of j<i J ij s j h ′ j , message passing process always makes H ′ local smaller compared with no message passing.It can be found that the difference between H local and H ′ local is that the latter has h ′ j .It encapsulates more information about the spin j beyond the current configuration spin value s j , which can indicate how much influence the message corresponding to the neighboring node features h j has on ⟨h i ⟩ M P and, combining with the coupling J ij , performs a weighted message passing.Thus, although not identical, it is possible to predict H local through H ′ local and thus we get the conclusion that Hamiltonians message passing mechanism makes local Hamiltonian smaller.
Similar to E s∼q θ (s) E(s), we have The second derivative of y(x) = ln(x) is y(x) (2) = −1/x 2 and thus y(x) is concave down, which satisfies < y( a+b 2 ).From the analysis of Step 1, the message passing process improves (reduces) the probability of configurations with low (high) energy.Therefore, message passing makes Ns k=1 ln q θ (s k ) smaller compared to no message passing.

In summary, combining the analyses of Step 1 and
Step 2, the message passing process makes the variational free energy F q smaller compared to no message passing.

III. NUMERICAL EXPERIMENTS
As described in Sec.II, MPVAN also includes existing neural network approaches as special cases, with different message passing mechanisms.We conduct experiments in Appendix.I to compare MPVAN with multiple message passing mechanisms, where MPVAN with Hamiltonians MP performs best.Therefore below we consider comparing the MPVAN with Hamiltonians MP to existing methods, and if not specified, in the following MPVAN refers to for MPVAN with Hamiltonians MP.We experiment on two classes of fully connected and intractable models, the WPE and the SK models.
The Wishart Planted Ensemble (WPE) [25] is a class of fully connected Ising models with an adjustable difficulty parameter α and planted solutions, which make it an ideal benchmark problem for evaluating heuristic algorithms.The Hamiltonian of the WPE is defined as where the coupling matrix {J ij } is a symmetric matrix that is subject to copula distribution.More details about the WPE can be found in Ref. [25].First, we discuss an important issue, mode collapse, which occurs when the target probability distribution FIG. 4. The residual energy per site of MPVAN with benchmark methods varies with system size N .(a) On the WPE, the ϵres/N averages on 30 instances and each for 30 runs, all instances with the system size N and α = 0.2.When N ≥ 50, the problem instances cannot be solved due to rough energy landscapes.(b) On the SK model, the residual energy per site averages on 30 instances and each for 10 runs.Since the energy of the ground state cannot be determined, we use the lowest energy across MPVAN, VAN, SA, and PT to replace it.Due to computational limitations, we exclude VCA from comparison when N > 100 as its speed is about N/10 times slower than MPVAN when trained under the same hyperparameters.More details regarding computational speed of MPVAN and other methods can be found in Appendix V.

FIG. 5.
On the variants of the SK model, the residual energy per site of MPVAN with benchmark methods varies with average degree of each node in graphs with N = 200 averaging on 30 randomly generated instances and each for 10 runs.
has multiple peaks but networks only learn a few of them.It severely affects the sampling ability of autoregressive neural networks [23].Entropy is commonly used in physics to measure the degree of chaos in a system.The greater the entropy, the more chaotic the system.Therefore, we can use the magnitude of entropy to reflect whether mode collapse occurs for the variational distribution.In our experiments, we investigate how the negative entropy, a part of variational free energy F q in Eq. ( 12), changes during training, which is defined as where S is entropy.Equivalently, the smaller the negative entropy, the more chaotic the system.As shown in Fig. 3, the change of −S from MP-VAN shows an increasing-decreasing-increasing trend, while −S from other methods is monotonely increasing and quickly convergent.When training step ≥ 500, mode collapse occurs for other methods, while until training step ≥ 2000, it occurs for MPVAN.Therefore, MPVAN delays the emergence of mode collapse greatly.In addition, we also consider the impact of learning rates on the emergence of mode collapse in Appendix.VI, where mode collapse always occurs later in MPVAN than in VAN.
Next, we benchmark MPVAN with existing methods when calculating the upper bound to the energy of the ground state, i.e., finding a configuration s to minimize the Hamiltonian in Eq. ( 23), which is also the core concern of combinatorial optimization.To facilitate a quantitative comparison, we employ the concept of residual energy, defined as where H min represents the minimum value of the Hamiltonians corresponding to 10 6 configurations sampled directly from the network after training, and E G is the energy of the ground state.The ⟨. ..⟩ ava denotes the average over 30 independent runs for one same instance, and [. . .] ava means averaging on 30 instances.In the figures representing the residual energy, such as Fig. 4 and  As depicted in Fig. 4(a), the residual energy obtained by our method consistently is lower than that of VAN, VCA, and SA across all system sizes when averaging on 30 instances and each for 30 runs.Even compared with state-of-the-art parallel tempering (PT) [49,50], MP-VAN also exhibits slightly better performance in terms of average residual energy, but significantly better in terms of minimum residual energy.For WPE instances with system size N = 50, MPVAN can still find the ground state with non-ignorable probability, but other methods cannot.As the system size increases, MPVAN has greater advantages over existing methods in giving a lower residual energy.Since the number of interactions between spin variables is |{J ij ̸ = 0}| = N 2 for fully connected systems, larger systems have much more interactions between spin variables.The advantages indicate that using MPVAN to consider these interactions performs better in rougher energy landscapes.We always average 30 instances to reflect the general properties of models, and the differences between instances can be seen in Appendix.I. Also, each method runs independently 30 times on the same instances to weaken the influence of occasionality in the heuristics training, which can be found in Appendix.
The training samples for VAN and VCA are the same as those for MPVAN, with some fine-tuning of parameters.Assuming the number of inner loops of SA is N inlop at each temperature, and the total number of training samples for SA is Assuming the number of chains of PT is N chain and the number of random flips at each chain is N rf , then the number of training samples for PT of 1 replica is Therefore, corresponding to 1 run of MPVAN, SA runs ⌈N M P sam /N SAsam ⌉ times independently and PT runs ⌈N M P sam /N P T 1sam ⌉ replicas, and then they output their respective best results to benchmark MPVAN.It is important to note that the performance of each algorithm depends not only on the number of training samples used in training, but also on the training parameters, such as the annealing schedule and initial and final temperatures.
We have fine-tuned the training parameters of each algorithm to maximize its best performance.We also experiment on the Sherrington-Kirkpatrick (SK) model [43], which is one of the most famous fully connected spin glass models and has significant relevance in combinatorial optimization and machine learning applications [51,52].Its Hamiltonian is also in the form of Eq. ( 23), where {J ij } are from a Gaussian distribution with the variance 1/N and a symmetric matrix.
As illustrated in Fig. 4(b), our method provides significantly lower residual energy than VAN, SA, and PT across all system sizes when averaging on 30 instances and each for 10 runs.Notably, as the system size increases, the advantages of our method over VAN, SA, and PT become even more pronounced, which is consistent with the trend observed in WPE experiments.We also show the approximating ratio results on the WPE and the SK model in Appendix.IV, which is of concern to researchers of combinatorial optimization problems.
Inspired by the correlations between node's degree and difficulty in finding the ground state in maximum independent set problems [40,41], in addition to experimenting on fully connected models, we also consider experiments on models with different connectivity, i.e., degrees of nodes in graphs.Since the SK model has been widely studied, it may be interesting to design new models based on the SK model.We generate models with different connectivity by deleting some couplings of the SK model and name them variants of the SK model.At each degree, we always randomly generate 30 instances, and the couplings {J ij } are from a Gaussian distribution with a variance of 1/N and a symmetric matrix.As shown in Fig. 5, our method gives a lower residual energy than VAN and SA at all degrees.Moreover, as degree increases, the advantages of our method over VAN and SA become even more pronounced.The denser the graph, the larger the number of interactions between spin variables.The advantages show that our method, which takes into account these interactions, is able to give lower upper bounds to the free energy.
In the following, we focus on estimating the Boltzmann distribution and calculating the free energy as annealing.As a proof of concept, we use the WPE with a small system size of N = 20, where 2 N configurations can be enumerated and the exact Boltzmann distribution and exact free energy F can be calculated within an acceptable time.We set α = 0.05 and thus it is difficult to find the ground state due to strong low-energy degeneracy.
As shown in Fig. 6, when the temperature is high, i.e., when β is small, the D KL (q θ ||p) and the relative errors of F q relative to exact free energy F from MPVAN, VAN, and VCA are particularly small.Therefore, it is necessary to lower the temperature to distinguish them.As the temperature decreases, the probability of the configurations with low (high) energy in the Boltzmann distribution increases (decreases), thus making it more difficult for neural networks to estimate the Boltzmann distribution.However, we find that the D KL (q θ ||p) obtained by our method is much smaller than that of VAN and VCA, which indicates that the variational distribution q θ (s) parameterized by our method is closer to the Boltzmann distribution.Similarly, our method gives a better estimation of free energy than VAN and VCA.These results illustrate that our method takes into account the interactions between spin variables through message passing and is more accurate in estimating the relevant physical quantities.

IV. CONCLUSION AND DISCUSSIONS
In summary, we propose a variational autoregressive architecture with a message passing mechanism, which can effectively utilize the interactions between spin variables, to solve intractable Ising models.Numerical experiments show that our method outperforms existing methods including VAN, VCA, SA and even PT, in solving two prototypical Ising spin Hamiltonians, WPE and the SK model, including more accurately estimating the Boltzmann distribution and calculating lower free energy at low temperatures.The advantages also come from the great mitigation of mode collapse during the training process of deep neural networks.Moreover, as the system size increases or the connectivity of graphs increases, MPVAN has greater advantages over existing methods in giving a lower upper bound to the energy of the ground state.
Formally, MPVAN and GNN are similar.We notice that some researchers have recently argued that graph neural networks do not perform as well as classical heuristic algorithms on combinatorial optimization problems [30,31] for the method in Ref. [16].Our work, however, draws the opposite conclusion.We argue that when the problems are in rough energy landscapes and hard to find the ground state (e.g., WPE), our method performs significantly better than traditional heuristic algorithms such as SA and even slightly better than state-of-the-art PT.Our method is based on variational autoregressive networks, which are difficult to train due to slow speed when the systems are particularly large, and thus MP-VAN is not easy to expand to very large problems.At the very least, we argue that MPVAN (or GNN) excels particularly well in certain intractable Ising models with rough energy landscapes, providing an alternative to traditional heuristics.
The code that supports the findings of this study is available from the corresponding author upon reasonable request.
which are named Spin MP in GCon-VAN (we denote the combination as Spin GCon-VAN) and graph with spin message passing mechanism (Graph Spin MP), respectively.We then conduct the same numerical experiments as in Fig. 2  It can be found in Fig. S2 that when considering the values of neighboring spins, the performance of Graph MP has slightly improved, while the performance of GCon-VAN has significantly deteriorated.Meanwhile, the performance of Graph Spin MP is still inferior to Hamiltonians MP when combined with VAN.We now show the mean and standard deviation in  Third, we compare the residual energy per site for MPVAN with existing methods across 30 instances of the WPE in Fig. S3.These problem instances are hard to solve, and none of the above methods can find the ground state.The results of GCon-VAN are poor and the area between the maximum and minimum values is too large, so we do not plot its corresponding color band.
As depicted in Fig. S3, MPVAN consistently achieves the best results on all instances.Notably, the residual energy decreases by a substantial margin, ranging from 16.82% to 33.49%, when compared to the results obtained using other methods.Also, the differences between instances are large, which indicates the necessity of using the average of 30 instances to reflect the general properties of problems in other experiments.
It can also be seen that GCon-VAN yields the least favorable performance, which is a supplemental result to Ref. [8] when problems are defined on dense graphs.
Fourthly, since a order of spins plays a critical role in the variational conditional probability q θ (s i |s 1 , s 2 , . . ., s i−1 ) in Eq. ( 2) of the main text and autoregressive message passing process, we consider the impact of it within MPVAN.To investigate this, we select two instances from Fig. S3 where the advantages in ϵ res of MPVAN over other methods are the smallest (the 7 th instance) and biggest (the 24 th instance).
In these two instances, we randomly generate 10 orders of spins and evaluate the performance of MPVAN and existing methods.As illustrated in Fig. S4, MPVAN consistently achieves the best results on all orders of spins.The results suggest that MPVAN may not exhibit a non-ignorable dependence on the order of spins, a characteristic similar to the standard autoregressive model [21].We show the ϵ res of all mechanisms mentioned when combined with VAN in Tab.II on the WPE with N = 60 and α = 0.2.To demonstrate the differences among instances, we also show the standard deviation of residual energy on 30 instances as std(ϵ).It can be seen that MPVAN outperforms all other methods in mean(ϵ res ), and smaller std(ϵ res ) denotes that it works stably on different instances.

II. THE OPTIMAL NUMBER OF LAYERS FOR MPVAN
Formally, MPVAN and GNN are similar.Meanwhile, the number of layers is an important hyperparameter of GNN, which can cause excessive smoothing when the number of layers is too big, and the effect tends to perform poorly when the number of layers is too small as well.Therefore, we also consider the effect of the number of layers on the performance of MPVAN here.Message passing process with one layer utilizes features from one-hop neighboring nodes.Moreover, by stacking MPVAN layers, we can access neighboring features from nodes multiple hops away.As demonstrated in Fig. S5(a), we visualize the message passing process of three layers for node 10.It is important to note that nodes 11 and 12 are deliberately excluded from the MP process to preserve the autoregressive property.
We conduct experiments to determine the optimal number of layers for MPVAN.As illustrated in Fig. S5(b), the residual energy exhibits a trend of decreasing and then increasing as we vary the number of layers.Notably, the residual energy reaches its minimum when there are 3 layers, a result consistent across system sizes, including N = 30, 60, and 90.Fewer layers limit the ability of networks to pass features from more distant neighboring nodes, while a large number of layers can lead to excessive smoothing of node features, making them indistinguishable.Therefore, we always adopt a 3-layer network for MPVAN.

III. THE OCCASIONALITY IN HEURISTIC METHODS TRAINING
In the main text, we consistently employ the average of multiple runs for the same instance to provide a representative overview of its general p.Here we present the residual energy for 30 individual runs on a single WPE instance with N = 60 and α = 0.2.
The residual energy of 30 independent runs are presented in Fig. S6, which shows the occasionality in heuristic methods training.There is substantial divergence in the outcomes of 30 independent runs of the neural networks, with some runs yielding results that are as much as 168.94% to 411.36% higher than the minimum residual energy achieved by the respective methods.This significant variation underscores the necessity of employing the average of multiple runs to accurately reflect the general properties in other experiments.We show the approximating ratio results corresponding to Fig. 4 of the main text in Fig. S7.Similar to residual energy, MPVAN always provides a better approximation of the energy of the ground state than benchmark algorithms on the WPE and the SK model.Notably, for the WPE with N = 30 and N = 40, there are some instances where some methods can find the ground state with a non-negligible probability.However, when N ≥ 50, none of the methods can identify the ground state for any of the instances.

V. THE TRAINING SPEED
We also evaluate the training speeds of VCA, VAN, and MPVAN to provide a comprehensive understanding of their computational efficiency.The training speed of VCA is slower compared to VAN and MPVAN due to its reliance on the RNN structure.In the RNN, hidden units from the same layer must be computed sequentially, whereas VAN and MPVAN can be computed concurrently.To quantify these differences, we record the running speeds of MPVAN, VAN, and VCA, with the results summarized in Tab.II.The data presented represents the time required for a single training step on a same instance, and all methods are evaluated on the same NVIDIA 3090 GPU and with the same hyperparameters.a Note that all the methods have the same hyperparameters and are compared on the same NVIDIA 3090 GPU.
As presented in Tab.II, the training time of VCA for identical parameters is approximately N/10 times longer than that of VAN and MPVAN for instances with a system size of N .To maintain manageable computational requirements, we limit our comparisons to VCA on instances with N ≤ 100 in the remaining experiments.

VI. THE NEGATIVE ENTROPY AT DIFFERENT LEARNING RATES
Here we show the negative entropy at different learning rates as a supplement to Fig. 3 of the main text.First, the distribution of MPVAN is consistently more uniform than that of VAN.When mode collapse emerges in VAN, MPVAN still has a larger negative entropy, indicating a more uniform distribution.Therefore, MPVAN delays the emergence of mode collapse.The negative entropy of MPVAN is smaller because the message passing changes the variational distribution, making some conditional probabilities small, resulting in a decrease in the joint probability of the configurations.
Second, when the learning rate is greater than 0.004, there is a decrease in negative entropy during training.It is because when learning rate is large, the training samples have a great impact on the back propagation of the neural network.The change gets stronger as the learning rate increases, making some conditional probabilities particularly small, so the negative entropy will be extremely small.Also, the minimum negative entropy decreases with increasing learning rates.
In summary, regardless of the choice of learning rates, the distribution of MPVAN is always more uniform than that of VAN, and mode collapse occurs at lower temperatures.Therefore, MPVAN delays the emergence of mode collapse.

FIG. 2 .
FIG. 2. Schematic diagram of the network architecture of MPVAN and four autoregressive message passing mechanisms, which are shown on a problem instance with 3 edges and 4 spins.The spins are represented separately with numbers 1 to 4, and node features are represented separately with hi, i = 1, 2, 3, 4. (a) The network architecture of MPVAN.The spin configuration s = {±1} N is the input to the network, ŝ is the output, and h denotes the hidden layer.The ⟨s⟩MP and ⟨h⟩MP are updated from s and h by message passing, respectively.The brown solid arrow indicates that neighboring nodes participate in message passing, while the brown dashed arrow indicates that there are connections between neighboring nodes but message passing is not performed to preserve the autoregressive property.The {aij} are the coefficient in message passing process, which vary for different message passing mechanisms.(b) The processes of four autoregressive message passing mechanisms when updating h3.Under the MP mechanism used inVAN [3], message passing is not performed, which is equivalent to the identity transformation of h3.Under the MP mechanism used in GCon-VAN[8], message passing performs according to the adjacency matrix A, which updates the h3 based on the topology structure of the graph.For the Graph MP mechanism we designed, message passing is performed by using the couplings Jij of the Hamiltonian, which updates h3 based on the couplings and Z3 = |J31| + |J32|.The Hamiltonians MP mechanism we designed updates h3 based on the couplings and values of neighboring spins s1 and s2, which is also the message passing mechanism used in MPVAN.

FIG. 6 .
FIG.6.The KL divergence and relative errors vary with β on the WPE with N = 20 and α = 0.05.(a) The KL divergence DKL(q θ ||p) between the variational distribution q θ and the Boltzmann distribution p.The inset shows the DKL(q θ ||p) when β is small.(b) The relative errors of Fq relative to exact free energy F .The inset shows the relative errors when β is small.

Fig. 5 ,
Fig. 5, the solid line in the figures indicates the average value of the residual energy, and the color band indicates the area between the maximum and minimum values of the residual energy of 30 independent runs for the corresponding algorithm.Both solid lines and color bands are obtained by averaging 30 instances.As depicted in Fig.4(a), the residual energy obtained by our method consistently is lower than that of VAN, VCA, and SA across all system sizes when averaging on 30 instances and each for 30 runs.Even compared with state-of-the-art parallel tempering (PT)[49,50], MP-VAN also exhibits slightly better performance in terms of average residual energy, but significantly better in terms of minimum residual energy.For WPE instances with system size N = 50, MPVAN can still find the ground state with non-ignorable probability, but other methods cannot.As the system size increases, MPVAN has greater advantages over existing methods in giving a lower residual energy.Since the number of interactions between spin variables is |{J ij ̸ = 0}| = N 2 for fully connected systems, larger systems have much more interactions between spin variables.The advantages indicate that using MPVAN to consider these interactions performs better in rougher energy landscapes.We always average 30 instances to reflect the general properties of models, and the differences between instances can be seen in Appendix.I. Also, each method runs independently 30 times on the same instances to weaken the influence of occasionality in the heuristics training, which can be found in Appendix.III.Since these methods are trained in different ways, we keep the total number of training samples used in training and final sampling after training the same for all methods.The training samples of MPVAN consist of two parts, training samples and final sampling samples.Assuming annealing number N annealing , training N training Fig. 5, the solid line in the figures indicates the average value of the residual energy, and the color band indicates the area between the maximum and minimum values of the residual energy of 30 independent runs for the corresponding algorithm.Both solid lines and color bands are obtained by averaging 30 instances.As depicted in Fig.4(a), the residual energy obtained by our method consistently is lower than that of VAN, VCA, and SA across all system sizes when averaging on 30 instances and each for 30 runs.Even compared with state-of-the-art parallel tempering (PT)[49,50], MP-VAN also exhibits slightly better performance in terms of average residual energy, but significantly better in terms of minimum residual energy.For WPE instances with system size N = 50, MPVAN can still find the ground state with non-ignorable probability, but other methods cannot.As the system size increases, MPVAN has greater advantages over existing methods in giving a lower residual energy.Since the number of interactions between spin variables is |{J ij ̸ = 0}| = N 2 for fully connected systems, larger systems have much more interactions between spin variables.The advantages indicate that using MPVAN to consider these interactions performs better in rougher energy landscapes.We always average 30 instances to reflect the general properties of models, and the differences between instances can be seen in Appendix.I. Also, each method runs independently 30 times on the same instances to weaken the influence of occasionality in the heuristics training, which can be found in Appendix.III.Since these methods are trained in different ways, we keep the total number of training samples used in training and final sampling after training the same for all methods.The training samples of MPVAN consist of two parts, training samples and final sampling samples.Assuming annealing number N annealing , training N training FIG. S2.For the five MP mechanisms, the residual energy histogram on the WPE with system size N = 60 and α = 0.2 same as in Fig.1.Each method contains 9 × 10 6 configurations obtained from 30 instances and each for 30 runs.Note that the results of Spin GCon-VAN are too poor to be shown in the figure.

Fig. 1 ,
Fig. S1 and Fig. S2 of the energy of 9 × 10 6 configurations from all the mechanisms mentioned above when they are combined with VAN.
FIG.S3.For MPVAN and existing methods, the residual energy per site on the 30 WPE instances with N = 60 and α = 0.2 and each for 30 runs.
FIG. S5.Schematic of the multi-layer autoregressive message passing process and experimental results for MPVAN with different numbers of layers.(a) The multi-layer autoregressive message passing process.For node 10, we highlight the 3-hops neighboring nodes it employs in message passing using different colors (the message passing is illustrated by the solid arrow, and the neighborhood is illustrated by the k = 1, 2, 3 circle for node 10), and nodes 11 and 12 are not utilized to preserve the autoregressive property.(b) The residual energy per site varies with the number of layers for MPVAN across system sizes N = 30, 60, and 90 on the WPE with α = 0.2.
FIG.S6.The residual energy of 30 independent runs on the 0 th instance in Fig.4of the main text.
FIG. S7.The approximating ratio results corresponding toFig.4 of the main text.(a) The results on the WPE.(b) The results on the SK model.
as the output.For brevity, we set F = 1 and donate ⃗ h i and

TABLE I .
The mean and standard deviation in Fig. 1, Fig. S1 and Fig. S2 of the energy of 9 × 10 6 configurations drawn from each method.
a Here MPVAN denotes the combination of VAN with Hamiltonians MP.

TABLE II .
The ϵres and std(ϵ) of all mechanisms combined with VAN on 30 WPE instances with N = 60 and α = 0.2.

TABLE III .
Time for 1 training step of MPVAN, VAN, and VCA.