Belief propagation: accurate marginals or accurate partition function—where is the difference?

We analyze belief propagation on patch potential models—attractive models with varying local potentials—obtain all of the potentially many fixed points, and gather novel insights into belief propagation properties. In particular, we observe and theoretically explain several regions in the parameter space that behave fundamentally differently. We specify and elaborate on one specific region that, despite the existence of multiple fixed points, is relatively well behaved and provides insights into the relationship between the accuracy of the marginals and the partition function. We demonstrate the inexistence of a principal relationship between both quantities and provide sufficient conditions for a fixed point to be optimal with respect to approximating both the marginals and the partition function.


INTRODUCTION
The marginals and the partition function can be estimated in a straight-forward manner for tree-structured models but require efficient approximation methods if the graphical model contains loops.One such method is Belief Propagation (BP) that exploits the structure of probabilistic graphical models in order to approximate the marginal distribution and the partition function.
BP often provides accurate approximations and has been successfully applied in many applications including speech-and image-processing, social network analysis, and error-correcting codes, despite the lack of convergence and performance guarantees (Koller and Friedman, 2009;Pernkopf et al., 2014).The approximation accuracy may be severely affected by the existence of multiple fixed points with varying accuracy.Al-though obtaining and combining all fixed points is a well-established practice in the optimization literature (Braunstein et al., 2005;Kroc et al., 2007), the computation of all fixed points is a hard problem in its own for models with more general potentials (Knoll et al., 2018b;Srinivasa et al., 2016).
BP is directly related to the Bethe free energy (Yedidia et al., 2005) and there is fairly substantial literature that provides provable convergent algorithms by operating on the Bethe free energy.In particular, this includes methods that aim to obtain (Welling and Teh, 2003) or at least approximate (Shin, 2012;Weller and Jebara, 2014) the global minimum of the (non-convex) Bethe free energy.The approximated partition function, i.e., the Bethe partition function, bounds the exact partition function for attractive models (Ruozzi, 2012), which implies that the global minimum of the Bethe free energy provides the most accurate partition function.Similar properties are not known for the marginal accuracy and, except for rather simple models (Knoll et al., 2018b), it remains an open question whether accurate marginals are to be obtained at the global minimum of the Bethe free energy.
In this work, we analyze the difference between accurate marginals and an accurate partition function.Therefore, we go beyond well-established models (e.g., attractive models with identical or random potentials) and introduce a rich class of attractive models with inherent structure: patch potential models.These models exhibit many interesting phenomena and provide deep insights into the relationship between the approximation quality of the marginals and the partition function.
We discuss the properties of the solution space and empirically show that: (i) three different regions with fundamentally different properties exist; (ii) although it is often infeasible to obtain and combine all fixed points, there exists one region which allows us to do so; (iii) we observe that no principle relationship exists between the approximation quality of the marginals and the partition function and present fixed points that provide the most accurate Bethe partition function but not the most accurate marginals.
We formally define a well-behaved region that has only exponentially many (in the number of patches) fixed points, provide conditions for the existence of this region, and show why all fixed points are stable.The fact that only a limited number of fixed points exist, all of which are stable, further allows us to obtain the exact marginals and partition function by repeated (potentially in parallel) application of BP.
Moreover, we theoretically demonstrate how the accuracy of the marginals can be expressed as a ratio of Bethe partition functions.This result further clarifies why the fixed point that provides the most accurate marginals need not be the fixed point that provides the most accurate partition function.Additionally, we provide sufficient conditions for the global minimum of the Bethe free energy to provide the most accurate marginals.This paper is structured as follows: In Sec. 2 we review some background on probabilistic graphical models, introduce BP, and provide the connection to the Bethe approximation.In Sec. 3 we specify the models considered in this paper.Then, in Sec. 4 we focus on patch potential models and discuss different performance regions.We provide formal arguments in Sec. 5 that explain the empirical observations and lead to novel insights into the relationship of the marginal accuracy and the value of the Bethe free energy before finally concluding the paper in Sec. 6.

BACKGROUND
This section serves as a brief introduction to probabilistic graphical models.We further introduce the BP algorithm and show how it connects to the Bethe approximation.

PROBABILISTIC GRAPHICAL MODELS
First, we consider an undirected graph G = (X, E) that consists of a set of N nodes X = {X 1 , . . ., X N } and of a set of undirected edges E, where any edgde (i, j) ∈ E joins two nodes X i and X j .Note that we consider only graphs with single edges between the same pair of nodes, i.e., (j, i) = (i, j).For each node X i ∈ X we denote the set of neighbors by Then, an undirected probabilistic graphical model U = (G, Ψ) is defined by an undirected graph G = (X, E) and by a set of K potentials Ψ = {Φ 1 , . . ., Φ K }.The random variables X i ∈ X are in a one-to-one correspondence with the nodes and take values x i ∈ X .In this work, we focus on pairwise models, where all potentials consist of two variables at most and the joint distribution P X (x) factorizes according to The normalization function Z is the partition function and is of central interest in this work.The partition function can be obtained my minimizing the (Gibbs) free energy F, with min F = − ln Z (Yedidia et al., 2005).Another important quantity is the marginal distribution where the singleton marginals P Xi are of particular interest.Note that the above problems are in fact equivalent, as F obtains it minimum precisely for the marginal distribution but require a summation over all x ∈ X N configurations and are therefore intractable in general (Cooper, 1990).

BELIEF PROPAGATION
Belief propagation (BP) is an iterative method to obtain the marginal distribution and the partition function on tree-structured graphs and to approximate these quantities for graphs that contain loops.Identical principles are also applied in the sum-product algorithm in information theory and in the Bethe-method in physics; excellent overviews include e.g., (Kschischang et al., 2001;Mezard and Montanari, 2009).
BP recursively exchanges messages between random variables: let us denote the current iteration by n, then the messages from X i to X j are updated according to The messages require some normalization, e.g., so that µ ij = 1.The set of messages µ n contains the messages along all edges at iteration n; if the update equation (3) does not change the values of the messages, i.e., if µ n+1 = µ n , then BP is converged to a fixed point with the corresponding fixed point messages µ • .
The approximate singleton marginals PXi (x i ) and pairwise marginals PXi,Xj (x i , x j ) are computed by and are normalized by Z i , Z ij ∈ R. The set of all approximated marginals constitutes the pseudomarginals Note that there are possibly multiple fixed points (cf.Sec.3): we index all fixed points by m = 1, . . ., M and denote the pseudomarginals that belong to a certain fixed point by P m B .We say that a fixed point m is stable if a neighborhood exists such that BP converges to P m B if initialized inside this neighborhood.

BETHE APPROXIMATION
BP is closely related to some concepts from statistical mechanics; in particular fixed points of BP correspond to stationary points of the Bethe free energy F B that is constrained by the the set of valid pseudomarginals The Bethe free energy F B ( PB ) = E B ( PB ) − S B ( PB ) is a function of singleton-and pairwise marginals and is defined by the average energy E B ( PB ) and the Bethe entropy S B ( PB ) according to Moreover, the Bethe free energy relates to the the Bethe partition function according to An excellent treatment of free energy approximations and how this relates to BP can, e.g., be found in (Yedidia et al., 2005;Wainwright and Jordan, 2008;Mezard and Montanari, 2009).Most importantly, local minima F m B relate to the fixed points of BP P m B according to where P m B is the argument that corresponds to the local minimum F m B , i.e., every stable fixed point of BP corresponds to a local minimum of F B (Heskes et al., 2003).This correspondence put BP on a solid theoretical foundation, and also paved the way for many methods that operate on F B directly.As F B may, however, be nonconvex (cf.Sec. 4) considerable attention has been put into the proposal of convex relaxations that correspond to provable convergent message passing algorithms (Globerson and Jaakkola, 2007;Hazan and Shashua, 2008;Meltzer et al., 2009).Nonetheless, the results obtained by minimizing the Bethe approximation are often more accurate (Meshi et al., 2009).There are methods that can efficiently (i.e., in polynomial runtime) minimize the Bethe free energy for restricted model classes: in particular, these include sparse models (Shin, 2012) and attractive models (Weller and Jebara, 2014).

MODEL SPECIFICATIONS
We focus on one specific model: binary pairwise models, where every random variable X i takes values x i ∈ X = {−1, +1}.1 The local and the pairwise potentials are specified by couplings J ij ∈ R that act on each edge (i, j) ∈ E and by local fields θ i ∈ R that act on each random variable X i ∈ X according to Φ(x i , x j ) = exp(J ij x i x j ) and Φ(x i ) = exp(θ i x i ).The joint distribution from (1) consequently factorizes according to We consider only finite-size attractive models2 where all edges are attractive, i.e., where all couplings J ij > 0 are positive.Specifically, we consider models with equal couplings J ij = J for all edges (i, j) ∈ E. Three different types of attractive models can be distinguished that show increasingly complex behavior: (i) attractive models with vanishing local fields θ i = 0; (ii) attractive models with unidirectional fields, i.e., either θ i < 0 or θ i > 0; (iii) finally, attractive models with arbitrary local fields.Such models are particularly interesting in terms of their phase transitions and are studied under the name of ferromagnetic random-field Ising models (RFIM) in physics where all θ i are drawn according to some distribution.
Attractive models with vanishing fields either have a unique or two symmetric fixed points both for infinitesize models (Mezard and Montanari, 2009) as well as for finite-size models (Knoll et al., 2018b).The marginals of two fixed points m and k are considered as symmetric if for all X i .An eminent consequence of ( 10) is that symmetric fixed points must also have the same value of F B .
Attractive models with unidirectional fields show a similar behavior and -although not exactly symmetric -have two fixed points that are almost symmetric.
Another important concept are flipped random variables: a random variable is flipped if the marginals are not aligned with the local potential, i.e., if We further say that a fixed point is state-preserving if no random variable is flipped.If all marginals are in favor of the same state x i , i.e., if PXi (x i ) > 0.5 for all X i we call the corresponding fixed point biased towards x i .
Attractive models with arbitrary local fields exhibit many non-trivial properties, may have a complex energy landscape, and are studied as one of the simplest form of disordered systems (Young, 1998).Disordered systems are systems that potentially have many fixed points, whereas many random variables are flipped.

PATCH POTENTIAL MODELS
The definition of patch potential models follows the definitions of the RFIM, with the main difference that the local potentials are not i.i.d but obey a correlation between neighboring random variables.Moreover, we will only consider models with identical values for all local fields, albeit possibly with different sign, i.e., θ i ∈ {−θ, +θ}.
Definition 1. Patch potential models are binary pairwise models in accordance with (11) that have attractive couplings J ij = J > 0 and that consist of multiple non-overlapping patches Note that we will only consider models with sufficiently large patches, so that the exact marginals are statepreserving.Let us first consider a minimal example that is rich enough to exhibit some non-trivial (i.e., nonsymmetric) fixed points while being structured enough to admit only few fixed points.This example serves as a model that allows us to get some intuition (cf.Sec. 4) before we discuss the properties of patch potential models in a more general manner (cf.Sec. 5).
Figure 1: Exact solution for Example 1. Nodes are depicted in orange if P Xi (X i = 1) > 0.5 and in blue otherwise; the opacity illustrates the value of the marginals.
Example 1.Let G = (X, E) be a regular twodimensional grid graph of size n×n with two equal-sized patches.All variables in G 1 experience a positive local field θ 1 = θ whereas all variables in G 2 experience the same negative local field θ 2 = −θ (cf.Fig. 1).
The patch potential model is especially appealing as the composition of relatively few patches admits a simplified treatment and comes with a couple of beneficial properties.In particular, we can identify a region in the parameter space (θ, J) that features a structured and wellbehaved solution space (cf.Sec.4.2).

FIXED POINT BEHAVIOR
If BP converges, it often provides accurate results; however, if multiple fixed points exist the performance may vary considerably between different fixed points.We briefly introduce the RSB (replica symmetry breaking) assumption that expresses the exact marginals as a combination of all fixed points and illustrate why its success is limited to optimization problems so far (Sec.4.1).
Then, we discuss the solution space of Example 1 over a range of parameters and specify different regions according to the structure of the solution space (Sec.4.2).
Assessing the approximation quality of a specific fixed point is required to state performance guarantees of BP.We recap existing results (Sec.4.3) and discuss how the error of the pseudomarginals and the Bethe partition function are related for patch potential models (Sec.4.4).

COMBINATION OF FIXED POINTS
and omit the existence of unstable fixed points corresponding to local maxima of F B .Note that the number of fixed points is always finite (Watanabe and Fukumizu, 2009).
Studying systems with such complex energy landscapes lies at the heart of the RSB theory.The RSB theory describes the decomposition of the exact solution into a convex combination of marginals that are weighted by their respective partition function so that This representation can be attributed to Mézard et al. (1987) and, rather than being a theorem, is a set of postulates. 3One underlying assumption is that the system actually exhibits multiple fixed points (unique fixed points would imply exact marginals otherwise); an accessible introduction to the RSB theory and all underlying assumptions can be found in (Mezard and Montanari, 2009, Ch.19).Despite its non-rigorous flavor, (15) has been verified for a wide range of problems (e.g., random SAT problems and spin glasses).In particular, many stateof-the-art solvers for combinatorial problems rely on the RSB theory (Ravanbakhsh and Greiner, 2015).
Obtaining all fixed points that correspond to local minima of the Bethe free energy is a complex task only possible for small-scale models (Knoll et al., 2018b) and models with certain structure (e.g., random graphs (Coja-Oghlan and Perkins, 2019)), or potential-type (e.g., for optimization problems (Zdeborová and Krzakala, 2016)).One efficient way to evaluate (15) for constrained satisfaction problems is known as survey propagation (Braunstein et al., 2005).The extension to more general models, however, still remains somewhat elusive.

APPROXIMATE SURVEY PROPAGATION
Survey propagation was recently applied to similar models as in this work (Srinivasa et al., 2016).This was achieved by assuming that the fraction of randomly initialized BP runs P m µ converging to the m th fixed point provides an approximation of the partition function Z m B .This assumption is valid for attractive models with vanishing local fields; yet it is unclear how this generalizes to models with non-vanishing local fields.
We aim to validate the assumption for regular grid graphs with n × n variables, θ = 0, and with couplings large enough to admit two fixed points.Therefore, we compare both measures for both fixed points by relating the 3 In physics one deals with the decomposition of the Gibbs measure (i.e., the joint distribution) into a weighted combination of Bethe measures (that correspond to BP fixed points).ratio between the partition functions Z 1 B /Z 2 B to the ratio P 1 µ /P 2 µ .The log-ratio4 between both measures is depicted in Fig. 2: one would expect a constant value close to zero if P m µ provides a good estimate of Z B ; this is obviously not the case as Z 1 B /Z 2 B grows more rapidly.We conclude that the fraction of BP runs serves as a poor estimate of the partition function with the consequence that an approximate evaluation of ( 15) leads to inaccurate marginals.This is particularly true as the local field and the model size increase.This raises two immediate questions: (i) Can we specify certain model-structures or parameter configurations that grant efficient methods to obtain all fixed points in order to evaluate (15)?(ii) If we obtain a subset of all fixed points S ⊂ S, can we compare the available fixed points and select the best one?

SOLUTION SPACE
The solution space for a wide range of patch potential models is analyzed to answer whether parameter configurations exist for which all fixed points can be obtained efficiently.A more formal analysis that explains the subsequent observations is presented in Sec. 5.
Let G be a 10 × 10 grid graph with two equal-sized patches (Example 1).This model exhibits three different regions, separated by critical values J A (θ) and J C (θ); see Fig. 3 and Fig. 4 for an illustration of the decomposition into multiple fixed points according to (15).
A unique fixed point exists for J < J A (θ), i.e., inside region (I), and BP converges; this fixed point is statepreserving but slightly overestimates the marginals (cf.Sec.4.4).Additional fixed points emerge inside region (II) as the coupling strength increases to J A (θ) < J < J C (θ).There are three fixed points (cf.Thm. 3) and all three fixed points are stable (cf.Thm. 4).These fixed points consist of two symmetric fixed points where all marginals favor one particular state and one statepreserving fixed point (cf.Sec.4.4).As the coupling strength increases even further to J > J C (θ), i.e., inside region (III), all three fixed points remain but are suddenly accompanied by many more fixed points.It will therefore be increasingly hard to obtain all fixed points numerically, so that one can only hope to obtain a subset of all fixed points in practice.
The actual boundaries between the regions are numerically estimated and are depicted in Fig. 4. The fixed points are obtained by repeated application of BP (2000 times for each (θ, J)) with different random initial conditions.Furthermore, we apply random scheduling to enhance the convergence properties as any predetermined schedule would favor a specific fixed point.
To answer question (i) from Sec. 4.1.1:one region exists in the parameter space in blue) for which all fixed points can be obtained efficiently.For region (III) (illustrated in red), however, the number of fixed points suddenly increases and we cannot rely on BP to obtain all fixed points.

APPROXIMATION ACCURACY
Let J > J c (θ) and assume that a subset of all fixed points S ⊂ S is provided; then, how can we select the best one?
Unfortunately, there is no way to tell us how accurate a particular fixed point is (if we do not have access to the exact solution).It is therefore an important problem in its own to measure the accuracy, or at least provide a bound on the approximation error.We will first discuss established results regarding the accuracy of both the Bethe partition function and the pseudomarginals.Subsequently, we will delve into the particularities for patch potential models and show how the accuracy may differ between both objectives.

PARTITION FUNCTION
The error of the partition function Z m B = Z B ( P m B ) of the m th fixed point is usually evaluated by the relative  16) holds; the approximated boundary according to ( 17) is depicted by the solid black line.
error of the log-partition functions (Gómez et al., 2007): Existing bounds on the partition function usually combine an upper bound (Wainwright et al., 2005;Jaakkola and Jordan, 1997) with some lower bound as e.g., the naive mean field (Wainwright and Jordan, 2008).Other bounds are based on the loop series expansions (Willsky et al., 2008) or the non-backtracking operator (Saade et al., 2014).For attractive models the Bethe partition function also bounds the partition function, i.e., Z B < Z (Ruozzi, 2012); obtaining the global minimum of F B is therefore optimal with respect to the error of the partition function as

MARGINALS
We measure the error of the singleton marginals by the mean squared error (MSE) according to Some results consider bounding the approximation error of the marginals instead of E Z (m), e.g., (Ihler, 2007;Mooij and Kappen, 2009;Leisink and Kappen, 2003;Weller and Jebara, 2014).We are not aware of an explicit relationship that connects both worlds except for homogeneous 5 attractive models (cf.Lm. 1).It is there-5 These are models that have a single value J for all edges and a single value θ for all variables 0.4 0.5 0.6 0.7 0.8 0.9 fore often assumed that minimizing F B will be optimal in terms of marginal accuracy for more general models as well (cf.Knoll et al. (2018a); Weller et al. ( 2014)), i.e., This is, however, not the case as we show in Sec.4.4.

MARGINALS AND PARTITION FUNCTION
We aim to evaluate the relationship between the accuracy of the pseudomarginals and the accuracy of the partition function and whether (16) holds in general.First, we state that (16) does hold for homogeneous attractive models that have two fixed points at most (Weller et al., 2014); this is a direct consequence of (15).
Lemma 1. Attractive models with identical values θ i = θ have two fixed points for J > J A (θ).The fixed point m that minimizes E Z (m) further provides the global minimum min L (F B ) and minimizes E P (m) as well.
Second, we empirically validate whether minimizing F B will provide the most accurate marginals for Example 1. Fig. 5 illustrates the error in the marginals and the error in the partition function for all fixed points.The fixed point that provides the global minimum to F B , and thus minimizes E Z (m) is emphasized in blue, the fixed point minimizing E P (m) is emphasized in red, whereas the fixed point minimizing both quantities jointly is emphasized in green.
Let us take a closer look at region (II) in particular: three fixed points exist that can be combined to yield the exact solution (see Fig. 1 for the exact solution).Two of these fixed points, r and q, are each biased towards one state and, because of the symmetric model, have identical values F r B = F q B .The state preserving fixed point p on the other hand provides the most accurate marginals inside (II).However, while p also provides the global minimum of F B for small values of J, Fig. 5 shows that F p B turns into a local minimum for J ≥ 0.65 .No principle relationship between the accuracy of the marginals and the partition function can therefore be observed inside (II) and ( 16) does not necessarily hold (cf.Thm. 6).
For region (III) many more fixed points (u, v, . ..) emerge that all have similar values E Z (u) and E P (u); we visualize some of them in Fig. 5a.These fixed points provide slightly more accurate marginals than the state-preserving one, although it should be noted that all fixed points do not approximate the marginals well inside (III).On the contrary, considering E Z (u), these additional fixed points provide the worst approximation to the partition function and have even higher values The biased fixed points p, q that approximate the marginals worst, on the other hand, approximate the partition function relatively well.
Why fixed points exist that minimize the marginal error but are only local minima of F B can, however, not be answered by the above observations.Closer inspection of F B for different types of fixed points reveals a threshold (black dots in Fig. 4) below which ( 16) holds.Some mild assumptions on the solution space lead to a lower bound on this threshold (cf.Thm.7) according to This bound, illustrated by the solid black line in Fig. 4, becomes asymptotically exact.Note that the slope, defined by ( 17) increases with the model size N so that the global minimum of F B provides the most accurate marginals for a wider range of parameters.
Here we properly define the boundaries J A (θ) and J C (θ) between different regions and provide formal arguments that explain the observations from Sec. 4.2.While some properties are directly attributable to (15), several results are based on the fact that the patch potential model consists of multiple patches with a unidirectional local field.First, we need to prepare an alternative update equation that makes the interactions between two patches more explicit.For that purpose, we will introduce an effective field that acts on the boundary of each patch and incorporates the influence form all other patches.
We refer to the appendix for the proofs and only state the Theorems and discuss their implications.Additionally, we prepare some corollaries that simplify the results for models with two equal-sized patches as in Example 1.

EFFECTIVE FIELD
We introduce an effective field θi for all variables that lie on the patch-boundary to incorporate the interactions with the neighboring patches.
Theorem 2 (Effective Field).Let X i be a variable on the boundary of patch X i that receives messages from inside, i.e., X k ∈ X i , and outside, i.e., X j ∈ X\X i , the patch.
The effective field θi acts on the boundary according to: Messages from outside the patch are now subsumed by θ and the additive terms in (18) will be positive if µ ji (X i = 1) > µ ji (X i = 0) and negative otherwise.This is particularly important in the definition of the region boundaries and admits "independent" treatment of every patch.

REGION (II)
The notion of an effective field (Thm.2) allows us to define the boundaries between the three distinct performance regions of patch potential models.We discuss the solution space in detail and what can be said about the performance of BP.Let us denote the second region, i.e., the region where the global behavior can be inferred by treating the patches individually by (II) = {θ, J}.
Definition 2 (Region).A parameter set (θ, J) ∈ (II) if and only if the following conditions are satisfied: (1.) Let J A (G i , θ) denote the critical value for the couplings beyond which multiple fixed points exist. 6Then every patch G i ∈ G must have its respective threshold below the actual coupling strength, i.e., J A (G i , θ) < J (2.) Consider all pairs of patches G i and G j ; if one patch, e.g., G i has its variables flipped, the imposed effective field on the boundary must stabilize the second patch G j so that J < J A (G j , θ) = J C (G j , θ).
These conditions implicitly define the "well-behaved" region (II).Def.2.1 provides the lower boundary of region (II) as only a unique fixed point would exist otherwise.It may be less obvious how Def.2.2 provides the upper boundary of region (II).Note that J < J A (G j , θ) is a necessary condition if G i is flipped, as parts of G j would flip otherwise and lead to disordered behavior (cf.Fig. 5a).The restriction to (II) and the exclusion of disordered solutions further validates the RSB assumption (Mezard and Montanari, 2009, Ch.19).

PROPERTIES OF REGION (II)
In this work we are particularly interested in understanding the properties of BP inside region (II) that complies with the following properties: Theorem 3 (Existence).Let U be a patch potential model with (θ, J) ∈ (II).The amount of fixed points M grows with the number of patches (rather than the number of variables Thm. 3 is of great practical relevance for the RSB assumption (15), i.e., whether a combination of BP fixed points can form the exact solution.The fact that there is a relatively small number of fixed points makes the task of obtaining them practically feasible.Existence alone, however, is not sufficient as we have to rely on some numerical method that obtains all fixed points; if we aim to apply BP for that matter there is the additional requirement for all fixed points to be stable.Fortunately, it turns out that all fixed points inside (II) are stable indeed.Theorem 4 (Stability).Let U be a patch potential model with (θ, J) ∈ (II).Then, every fixed point P m B is a stable fixed point for BP.
Finally, as an immediate consequence of the limited amount of fixed points (Thm.3), all of which are stable (Thm.4), it follows that the exact solution can be computed according to (15) in practice.One can for example apply BP repeatedly, possibly in parallel, with random initialization to obtain and combine all fixed points.

MARGINAL ACCURACY
Theorem 5 (Marginal Accuracy).The MSE of the singleton marginals E P (k) of the k th solution P k B relates to the ratio of the Bethe partition functions according to Representing the MSE according to Thm 5 is particularly appealing as it omits the need for expressing the exact marginals.This further provides a way to express the ratio of the marginal error between two fixed points.
Corollary 5.1.The MSE-ratio of two fixed points k and l is a ratio of weighted partition functions according to: Expressing the ratio of the marginal error according to ( 19) is advantageous in elaborating on the difference between accuracy of the approximated marginals and the approximated partition function.We define the mismatch between P m Xi at two fixed points k and l by Now, let us denote the error of the state preserving fixed point by E P (p) and of the fixed point that has all marginals biases towards x i = 1 by E P (q).Thenmaybe non-surprising as the exact solution is state preserving as well -we show that the state-preserving fixed point has the most accurate marginals.Theorem 6 (Error Ratio).Let U be a patch potential model with (θ, J) ∈ (II).The state preserving fixed point p provides more accurate marginals than the fixed point q that has all marginals biased to one state, i.e., E P (p) E P (q) < 1.
In particular for models with two equal-sized patches, we can simplify the error ratio ( 19) considerably.Corollary 6.1 (Example 1).Let d = Q i (q, r) > 0, then It follows that the state preserving fixed point p minimizes the marginal error inside (II) irrespective of F p B .This has drastic implications and forbids any relationship between the fixed point minimizing the marginal error and the one minimizing the partition function error.

FIXED POINT MINIMIZING F B
However, despite Thm.6 the question remains where the difference between E Z (m) and E P (m) stems from?
We answer this question and provide conditions for argmin E Z (m) = argmin E P (m) to be valid.We further present an approximate condition for the statepreserving fixed point p to simultaneously provide the most accurate marginals and minimize F B .Let us define the following variables (cf.Sec.6.2.7 in the appendix for a formal introduction): E P is the set of all boundary edges; E C is the set of edges between variables that favor different states; N f and N c are the respective numbers of flipped and non-flipped variables; and ∆S B is the difference in the entropy between two fixed points.These sufficient conditions for (16) provide a guideline when it would be safe to select the fixed point according to the partition function value.This correspondence tends to hold for models with strong local potentials θ and with increased model-size N as shown in Cor.7.1.

CONCLUSION
In this paper we introduced and analyzed patch potential models and thus advanced the understanding of belief propagation's properties.In particular we inspected the difference between accurate marginals and an accurate partition function.
On the basis of our empirical evaluation and our theoretical analysis we gained several insights: (i) there exists a region for which the number of fixed points depends on the number of patches.This opens the door for methods that can efficiently obtain all fixed points to subsequently form the exact solution.(ii) We further demonstrated that there is no inherent relationship between the approximation quality of the marginals and the partition function.
(iii) Additionally, we introduced conditions that guarantee existence of a fixed point that simultaneously approximates the marginals and the partition function best.

(
Non-) convexity of the Bethe free energy depends on the structure of the graph and the potentials.If the model has loops and sufficiently strong couplings multiple local fixed points will exist.Let every fixed point m have an associated local minimum F m B , an associated partition function Z m B , and associated pseudomarginals P m B .We denote the set of all M fixed points by

Figure 2 :
Figure 2: P m µ is the fraction of BP runs that converge to fixed point m with the corresponding Bethe partition function Z m B .The mismatch increases with N and θ.

Figure 3 :
Figure 3: Illustration of the fixed points for all regions.The circle-width corresponds to the value of F B .

Figure 4 :
Figure 4: Illustration of all regions and boundaries for Example 1: The black dots depict the boundary below which (16) holds; the approximated boundary according to (17) is depicted by the solid black line.

Figure 5 :
Figure 5: Accuracy of the marginals (a) and of the partition function (b) for Example 1 with |θ i | = 0.1: we emphasize the fixed points minimizing E Z (m) (blue), minimizing E P (m) (red), and minimizing both quantities (green).
Theorem 7. Let us consider the state-preserving fixed point p with F p B and some other fixed point withF m B .Then, F p B < F m B is the global minimum if 2J(|E P | − |E C |) < θ(N − N c + N f ) + ∆S B .(23)For models with two equal-sized patches we can further simplify (23) significantly and state that: Corollary 7.1 (Example 1).The state-preserving fixed point provides the most accurate marginals and the global minimum F p B if (θ, J) ∈ (II) and if