Adaptive Thouless--Anderson--Palmer equation for higher-order Markov random fields

The adaptive Thouless--Anderson--Palmer (TAP) mean-field approximation is one of the advanced mean-field approaches, and it is known as a powerful accurate method for Markov random fields (MRFs) with quadratic interactions (pairwise MRFs). In this study, an extension of the adaptive TAP approximation for MRFs with many-body interactions (higher-order MRFs) is developed. We show that the adaptive TAP equation for pairwise MRFs is derived by naive mean-field approximation with diagonal consistency. Based on the equivalence of the approximate equation obtained from the naive mean-field approximation with diagonal consistency and the adaptive TAP equation in pairwise MRFs, we formulate approximate equations for higher-order Boltzmann machines, which is one of simplest higher-order MRFs, via the naive mean-field approximation with diagonal consistency.


I. INTRODUCTION
A Markov random field (MRF) is known as an important probabilistic graphical model in various scientific fields.There are a large variety of applications of MRFs, for example, in computer vision [1,2], engineering [3], machine learning [4,5], information sciences [6,7], and statistical physics.A Boltzmann machine [8], which is a kind of an MRF, is known as the fundamental probabilistic model in such fields.A typical Boltzmann machine is the same as an Ising model in statistical physics.A Boltzmann machine defined on a complete bipartite graph called a restricted Boltzmann machine (RBM) has frequently been used in deep learning [9,10].Statistical operations, such as the computation of expectations in MRFs, are computationally intractable in most cases because they require summations over all possible states of variables.Hence, we use approximate techniques, such as the Markov chain Monte Carlo (MCMC) method, for statistical computations.For RBMs, approximate learning methods based on MCMC sampling, such as contrastive divergence [11], have been proposed and successfully employed.They alleviate the computational intractability by using conditional independence of RBMs.
Mean-field approximations are effective for MRFs [12].Various mean-field-based methods have been developed in statistical mechanics, for example, the naive mean-field approximation, Thouless-Anderson-Palmer (TAP) approximation [13,14], Bethe approximation (or loopy belief propagation) [15,16], and the adaptive TAP approximation [17,18].Such mean-field methods allow for obtaining the approximate expectations of random variables in MRFs.In particular, the adaptive TAP approximation is known as one of the most powerful accurate methods in dense systems.The aim of this study is to extend the adaptive TAP approximation.Here and hereinafter, the term "adaptive TAP approximation" indicates the approach by Opper and Winther [17,18].
The linear response relation [19] is an important technique for obtaining the accurate approximations of higher-order ex-pectations.We can calculate such approximations from the expectations obtained by the mean-field methods using the linear response relation.For instance, susceptibilities (or covariances) are obtained using local magnetizations (or onevariable expectations) by utilizing the linear response relation.A message-passing type of algorithm based on the linear response relation is known as susceptibility propagation (SusP) [20] in statistical physics and as variational linear response in machine learning [21].However, algorithms that use linear response relation simply such as SusP experience the diagonal inconsistency problem [19,22,23].In an Ising model, the second-order moment of variable x 2 i should be unity because variable x i takes values of −1 or +1.However, the second-order moment obtained by such algorithms is not unity.Improved SusP (I-SusP), which is an improved version of SusP, was proposed by two of the authors to solve this problem in the context of SusP [23][24][25].I-SusP allows for using the linear response relation while maintaining diagonal consistency.This improves the approximation accuracy.Similar to SusP, I-SusP can be combined with various mean-field methods such as the ones listed above.
The demand for higher-order MRFs (MRFs with higherorder interactions) is growing continuously, particularly in computer vision [26,27].However, we cannot straightforwardly apply the adaptive TAP approximation to higher-order MRFs because in the conventional approaches to the adaptive TAP approximation, the energy function has to be written in a quadratic form with respect to the variables.It was found that the results obtained by the adaptive TAP approximation and I-SusP with the naive mean-field approximation are the same in an Ising model [23,28].Based on this, we expect the adaptive TAP approximation and I-SusP to be equivalent in other models.If this prediction is justified, we can construct the adaptive TAP approximation via the same calculation as I-SusP for any case.This implies that we can construct the adaptive TAP approximation for higher-order MRFs because I-SusP can be applied to various models, including higherorder MRFs.Here and hereinafter, the term "I-SusP" indicates the extended message-passing algorithm of belief propagation proposed by two of the authors [23].
The goal of this study is to formulate the adaptive TAP approximation for higher-order MRFs.In order to achieve this, first, we show the equivalence of the adaptive TAP approximation and the naive mean-field approximation with diagonal consistency in MRFs with quadratic energy functions (i.e., MRFs with Ising or non-Ising variables).This shows that the equivalence is justified at least in the models to which the adaptive TAP approximation can be straightforwardly applied.This fact strongly supports our prediction about the equivalence, based on which we tentatively accept our prediction as the ansatz.After that, we formulate the adaptive-TAP-like equation for higher-order Boltzmann machines, which is one of simplest higher-order MRFs, via the naive mean-field approximation with diagonal consistency.As the equivalence of the adaptive TAP approximation and the naive mean-field approximation with diagonal consistency has not been rigorously proven yet, we use the word "like" here and hereinafter.The term "naive mean-field approximation with diagonal consistency" indicates the approach employed in this study, which follows the same computational procedure as I-SusP, however, it is conceptually different from I-SusP.
The remainder of this paper is organized as follows: In Section II, we consider a pairwise MRF with a quadratic energy function.We introduce Gibbs free energy (GFE), which is a dual representation of Helmholtz free energy, for the pairwise MRF in Section II A. We derive the adaptive TAP free energy for the pairwise MRF using the GFE presented in Section II B. In Section III, we derive the naive mean-field approximation with diagonal consistency for the pairwise MRF and subsequently show the equivalence of the approximate equation given by this approach and the adaptive TAP equation derived in Section II B. In Section IV, we consider a higher-order Boltzmann machine and derive its adaptive-TAP-like equation via the naive mean-field approximation with diagonal consistency (Section IV B).In Section IV C, we show through numerical experiments that the naive mean-field approximation with diagonal consistency outperforms the simple naive mean-field approximation in the higher-order MRF, as expected.Finally, we summarize the study in Section V.

II. MARKOV RANDOM FIELD WITH QUADRATIC ENERGY FUNCTION
In this section, we consider a pairwise MRF with a quadratic energy function.Let us consider an undirected graph, G(V, E), with n vertices, where V = {1, 2, • • • , n} is the set of all vertices and E = {{i, j}} is the family of all undirected edges, {i, j}, in the graph.Random variables x = {x i ∈ X | i ∈ V } are assigned to the vertices.Here, X is a subset of R. We define the energy function (or the Hamiltonian) on the graph as where h = {h i | i ∈ V } and d = {d i | i ∈ V } are bias parameters (or the external fields) and anisotropic parameters, respectively, and J = {J i j | {i, j} ∈ E } are the coupling weight parameters between vertices i and j.We assume that there are no self-interactions (J ii = 0, ∀i ∈ V).All couplings are assumed to be symmetric (J i j = J ji ).Throughout this paper, we omit the explicit descriptions of the dependency on h, d, and J.However, it should be noted that almost all quantities described here and in the following sections depend on model parameters.Along with Eq. ( 1), a pairwise MRF is expressed as where is the partition function, and the summation implies x ∈X n = When X is a continuous space, the summation over x is replaced by integration.The inverse temperature is set to one throughout this paper.We refer to the MRF in Eq. ( 2) as the quadratic MRF.Note that Eq. ( 2) becomes the Gaussian MRF (or the Gaussian graphical model) [29] when X = (−∞, +∞), d i > 0 and the inverse covariance matrix is positive definite.The i jth elements of the covariance matrix are defined by .
The Helmholtz free energy of Eq. ( 2) is expressed as In the following sections, we introduce a GFE of Eq. ( 2), which is a dual representation of F.Moreover, we derive the adaptive TAP equation for Eq. ( 2) using the GFE.

A. Gibbs free energy and Plefka expansion
In this section, we introduce a GFE of the MRF in Eq. ( 2).Let us consider the Kullback-Leibler divergence (KLD) between a test distribution, Q(x), and the pairwise MRF in Eq. ( 2) The mean-field approximation can be formulated through the minimization of the KLD [30].The minimization of the KLD in Eq. ( 5) with respect to Q(x) is equivalent to the minimization of the variational free energy defined by because By minimizing the variational free energy under the normalization constraint and moment constraints the GFE is obtained as 7) and ( 8). ( The minimum of the GFE with respect to m and v is equivalent to the Helmholtz free energy, F = min m,v G(m, v).Moreover, the m i and v i that minimize the GFE coincide with x i and x 2 i , respectively, where f (x) := x ∈X n f (x)P(x) denotes the exact expectation for the distribution in Eq. ( 2).
For the Plefka expansion [31,32] described below, we introduce the auxiliary parameter α ∈ [0, 1] into the energy function in Eq. ( 1) as The auxiliary parameter adjusts the effect of the interaction term.When from the Lagrange multipliers corresponding to the first and second constraints in Eq. ( 8), respectively.It is noteworthy that The maximum conditions for b and c in Eq. ( 10) are obtained as and respectively, where and Z α (b, c) is the partition function defined in a manner similar to Eq. ( 3).We denote the solutions to Eqs. ( 11) and ( 12) by m(α) and v(α), respectively.Even though the solutions depend on all parameters in the model, we omit the description of the dependency, except for α, for the convenience of the subsequent analysis.The Plefka expansion is a perturbative expansion of Eq. ( 10) around α = 0.After a perturbative approximation, the corresponding approximation for the original GFE in Eq. ( 9) is obtained by setting α = 1.Several mean-field approximations are derived based on the Plefka expansion.For example, the naive mean-field approximation of Eq. ( 9), which is referred to as naive mean-field free energy, is obtained as follows: The expansion up to the first order of Eq. ( 10) is . By setting α = 1 in this expanded form, the naive mean-field free energy G naive (m, v) is obtained as where Based on Eqs. ( 11) and ( 12), b(0) and ĉ(0) satisfy for any m and v.The naive mean-field equation is obtained from the minimum condition of Eq. ( 13) with respect to m and v.Note that the TAP mean-field free energy [13,14] can be obtained via the expansion up to the second order of Eq. ( 10) [31].

B. Adaptive Thouless-Anderson-Palmer equation
In this section, we show the derivation of the adaptive TAP equation for the quadratic MRF defined in Section II via the conventional method: minimization of the adaptive TAP free energy.There are several approaches for deriving the adaptive TAP free energy for the quadratic MRF [17,18,33,34].Here, we focus on the approach based on the strategies proposed by Opper and Winther [17,18].The adaptive TAP free energy is defined as where the first term on the right-hand side of Eq. ( 17) is Eq. ( 10) with α = 0. G GMRF α (m, v) in Eq. ( 17) is Eq. ( 10) when X = (−∞, +∞), which corresponds to the Gaussian MRF [29].Using Gaussian integration, we have where parameters λ originate from the Lagrange multipliers corresponding to the first and second constraints in Eq. ( 8), respectively.Here, S α (Λ) is a symmetric matrix whose i jth element is defined by .
Executing the maximization with respect to λ, Eq. ( 18) becomes From Eqs. ( 10) and ( 19), Eq. ( 17) is obtained as The adaptive TAP equation corresponds to the minimum condition of Eq. ( 20) with respect to m and v. Hereinafter in this section, we denote the values of m and v at the minimum of Eq. ( 20) by m and v, respectively.From the minimum conditions of Eq. ( 20) with respect to m i and v i , we obtain bi (0) = h i + j ∈∂(i) and respectively, where ∂(i) := { j | {i, j} ∈ E } denotes the set of vertices that have a connection with i and Λ denotes the solution to the maximum condition for Λ in Eq. ( 20): Furthermore, from Eqs. ( 15) and ( 16), we have mi = Eqs. ( 21)-( 25) represent the adaptive TAP equation.We can obtain the approximate values of x i and x 2 i by solving the simultaneous equations with respect to m and v, respectively.However, the adaptive TAP equation includes matrix inversion (cf.Eq. ( 23)).This tends to obstruct the effective implementation of the adaptive TAP equation.
When X = {−1, +1} (i.e., when Eq. ( 2) is an Ising model), we have an alternative method of deriving the adaptive TAP equation using I-SusP with the naive mean-field equation [23,28].The adaptive TAP equation derived via I-SusP takes a message-passing type of formula, which does not explicitly include matrix inversion.This simplifies the implementation of the adaptive TAP equation.
However, in quadratic MRFs, the equivalence of the adaptive TAP equation and the approximate equation obtained from the naive mean-field approximation with diagonal consistency has not been explicitly shown beyond an Ising model.In the next section, we show that the naive mean-field approximation with diagonal consistency derives the approximate equation, which is equivalent to the adaptive TAP equation obtained in this section.

III. NAIVE MEAN-FIELD APPROXIMATION WITH DIAGONAL CONSISTENCY FOR QUADRATIC MARKOV RANDOM FIELD
In this section, we derive the approximate equation, which is the minimum condition of Eq. ( 13), via naive mean-field approximation with diagonal consistency.We show that it is equivalent to the adaptive TAP equation described in the previous section.
Let us consider the conventional SusP for the naive meanfield approximation.SusP is a message-passing type of method for obtaining approximations of susceptibilities (or covariances) χ exact i j := x i x j − x i x j .From the linear response relation, we have χ exact i j = ∂ x i /∂h j .SusP uses its approximation, χ exact i j ≈ χ app i j = ∂m app i /∂h j , where m app i is an approximation of x i obtained using a method such as the naive mean-field approximation.However, the susceptibilities obtained in this manner may not satisfy diagonal consistency.This implies that the relations may not hold, where v app i is an approximation of x 2 i obtained by employing the same method as that used for obtaining m app i .In I-SusP, we incorporate the diagonal trick method into SusP to satisfy the diagonal consistency in Eq. ( 26) [23][24][25].In this study, to derive approximate equations via the same computation as I-SusP with naive mean-field approximation, we extend the naive mean-field free energy in Eq. (13) as where Λ † := {Λ † i | i ∈ V } are the auxiliary parameters that are determined to satisfy the diagonal consistency in Eq. (26).
For fixed Λ † , we again denote the values of m and v at the minimum of Eq. ( 27) by m and v, respectively.The minimum conditions of Eq. ( 27) with respect to m i and v i lead to bi (0) = h i + j ∈∂(i) and respectively.The relations between { mi , vi } and { bi (0), ĉi (0)} are already given in Eqs. ( 24) and ( 25).Approximate susceptibilities are obtained via the linear response relation, χ i j := ∂ mi /∂h j .Therefore, from Eqs. ( 24) and ( 28), we obtain the simultaneous equations for the susceptibilities as where δ i j is the Kronecker delta.As mentioned earlier, Λ † should be determined to satisfy the diagonal consistency, χ ii = vi − m2 i .Therefore, they are determined by This is obtained from Eq. ( 30) and χ ii = vi − m2 i .Solving Eqs. ( 24), (25), and ( 28)-( 31) with respect to m, v, and χ provides the approximations for the first-order moments, second-order moments, and susceptibilities, respectively.
The equivalence of the solutions to the adaptive TAP equation (Eqs.( 21)-( 25)) and the approximate equation obtained from the naive mean-field approximation with diagonal consistency (Eqs.( 24), (25), and ( 28)-( 31)) can be easily verified.By considering Λi = Λ † i + 1/(v i − m2 i ), Eqs. ( 28) and ( 29) become Eqs.( 21) and (22).Furthermore, from Eq. (30), we obtain This implies that matrix χ is equivalent to the inverse of S 1 ( Λ).On the contrary, the diagonal susceptibilities obtained from the naive mean-field approximation with diagonal consistency satisfy 23)) is ensured.Based on the above, we found that the solutions to the adaptive TAP equation and the approximate equation obtained from naive mean-field approximation with diagonal consistency are generally equivalent in quadratic MRFs.This result supports the validity of our prediction about the equivalence of the adaptive TAP approximation and I-SusP with the naive mean-field approximation.Even though the equivalence has not been rigorously proven yet, we move to the following arguments by accepting it as the "ansatz".The advantage of the naive mean-field approximation with diagonal consistency compared with the conventional adaptive TAP approximation is that it is considerably easier to apply this method to models beyond quadratic MRFs, such as higherorder MRFs.In the typical approaches to the adaptive TAP equation [17,18,33,34], it is essential for the energy function to be quadratic, implying that applying such approaches to higher-order MRFs is not straightforward.In contrast, the naive mean-field approximation with diagonal consistency can be applied to models whose naive mean-field approximation can be explicitly described.This implies that we can consider an adaptive-TAP-like approximate equation in models beyond quadratic MRFs, such as higher-order MRFs, via the naive mean-field approximation with diagonal consistency.
On the other hand, the naive mean-field approximation with diagonal consistency has no particular advantage compared with the conventional adaptive TAP approximation in terms of computational complexity.If the MRF is a fully-connected model, the naive mean-field approximation with diagonal consistency costs O(n 3 ) similar to the conventional adaptive TAP approach.If the MRF is not a fully-connected model, for example, it is defined on a sparse graph such as a square-lattice (i.e., the expectation of the degree of the graph is at most O(1)), the naive mean-field approximation with diagonal consistency costs O(n 2 ).

IV. ADAPTIVE THOULESS-ANDERSON-PALMER EQUATION FOR HIGHER-ORDER MARKOV RANDOM FIELD A. Higher-order Boltzmann machine and its Gibbs free energy
We consider distinct subgraphs, µ ⊆ V, in G(V, E) and denote the family of all the subgraphs by C. Let us consider an MRF with higher-order interactions whose energy function is described as where J µ is the interaction weight among the vertices contained in µ.When all subgraphs in C are connected pairs in G(V, E), Eq. ( 32) is reduced to Eq. ( 1).This MRF is known as the higher-order Boltzmann machine [35].
In a manner similar to Section II A, we can derive the GFE and naive mean-field free energy for this higher-order Boltzmann machine.The GFE with auxiliary parameter α ∈ [0, 1] for adjusting the effect of the interaction is expressed as The Plefka expansion for Eq. ( 33) provides the naive mean-field free energy as where Z 0 ( b(0), ĉ(0)) is already defined in Eq. ( 14).Parameters b(0) and ĉ(0) satisfy Eqs. ( 15) and ( 16).The naive mean-field equation is obtained from the minimum conditions of Eq. ( 34) with respect to m and v.

B. Adaptive Thouless-Anderson-Palmer equation for higher-order Boltzmann machine
In a manner similar to Section III, we derive the adaptive-TAP-like equation for the higher-order Boltzmann machine via naive mean-field approximation with diagonal consistency.Similar to Eq. ( 27), we extend the naive mean-field free energy in Eq. ( 34) by installing the diagonal trick term: For fixed Λ ‡ := {Λ ‡ i | i ∈ V }, we again denote the values of m and v at the minimum of Eq. ( 35) by m and v, respectively.The minimum conditions of Eq. ( 35) with respect to m i and v i lead to bi (0 and respectively, where C(i) ⊆ C is the family of the subgraphs containing i.The relations between { mi , vi } and { bi (0), ĉi (0)} are already given in Eqs.(24) and (25).From Eqs. ( 24) and (36), the linear response relation, χ i j = ∂ mi /∂h j , is obtained as Finally, combining the diagonal consistency, χ ii = vi − m2 i , with Eq. (38) provides the equations for determining Λ ‡ i as Solving Eqs. ( 24), (25), and ( 36)- (39) with respect to m, v, and χ simultaneously provides the approximations for the firstorder moments, second-order moments, and susceptibilities, respectively, for the higher-order Boltzmann machine.Note that Eqs. ( 24), ( 25), ( 36) and ( 37) incorporate the effect of diagonal consistency and contribute to the inference.In Reference [36], the inference and learning of the higher-order Boltzmann machine have been generalized.In their work, while the learning is constrained by the "diagonal couplings" corresponding to the diagonal consistency, the inference is done by usual mean-field equations that do not include such constraints.In this respect, the proposing method in this section and the method in Reference [36] are fundamentally different.
(A detailed derivation of the learning using the approximate equations in this section is omitted here.)For p-spin Sherrington-Kirkpatrick model [37,38], the naive mean-field approximation with diagonal consistency yields the TAP equation presented in Reference [39].The details are shown in Appendix A.

C. Numerical experiments
In this section, we demonstrate the performance of the naive mean-field approximation with diagonal consistency presented in Section IV B. In the experiments, we consider a higher-order Boltzmann machine whose energy function is where C 2 := {{i, j} | i ∈ V, j ∈ V, i < j} is the family of all distinct pairs and C 3 := {{i, j, k} | i ∈ V, j ∈ V, k ∈ V, i < j < k} is the family of all distinct triplets.Eq. ( 40) is a special case of Eq. (32).When J 3 = 0, Eq. ( 40) is reduced to the quadratic energy shown in Eq. (1).In the experiments described below, we set n = 10 and J 3 = 0.001.Parameters {h i } and {J i j } were independently drawn from Gaussian distributions N (0, 0.1 2 ) and N (0, σ 2 / √ n), respectively, where N (µ, σ 2 ) is the Gaussian distribution with mean µ and variance σ 2 .As the number of variables is not large in this setting, we can compute exact expectations and compare them with approximate expectations.We used mean squared errors (MSEs) defined by as the performance measure.We compared the solutions to the naive mean-field approximation with diagonal consistency presented in Section IV B and the simple naive mean-field approximation in terms of the MSEs.The solution to the simple naive mean-field approximation for Eq. ( 40) is obtained from the minimum conditions of Eq. ( 34) with respect to m and v.
Figure 1 shows the result of MSE m against σ when X = {−1, +1}.In this case, as x 2 i is always one, v i is also always one in both methods.Hence, MSE v is always zero.It is noteworthy that the value of d is unrelated to the result because the second term in Eq. ( 40) is constant.Figure 2 shows the result of (a) MSE m and (b) MSE v against σ when X = {−1, 0, +1}.
In this experiment, we set d = 0.01.The naive mean-field approximation with diagonal consistency outperforms the simple naive mean-field approximation in both experiments, as we expected.

V. DISCUSSION AND CONCLUSION
We have formulated the adaptive-TAP-like approximate equation for higher-order Boltzmann machines via the naive mean-field approximation with diagonal consistency.In the numerical experiments, we have observed that the expectations of the variables obtained using the adaptive-TAP-like equations are more accurate than those obtained using the simple naive mean-field approximation.It is noteworthy that a method almost the same as I-SusP was independently proposed around the same time by Raymond and Ricci-Tersenghi [28].While I-SusP considers only diagonal consistency, their method additionally involves constraints with regard to off-diagonal consistency [40].Our approach can be extended by employing such advanced constraints.
In addition, in Section IV, only models whose Hamiltonian does not include higher-order terms, such as x 2 i x j and x 3 i , are considered.Such higher-order terms should also be considered if the variables have discrete or continuous values.Because this generalization requires complicated formulations other than the procedure shown in Section III and Section IV, details of the generalization are omitted in this paper.This task will be addressed soon.
In this paper, we have reported the results of numerical performance evaluation for direct problems (inference).The adaptive TAP equation or other mean-field approaches that use the linear response relation are known to be effective against the inverse problem (learning) [36,[41][42][43].Application of the adaptive-TAP-like equation to the inverse problem and its performance evaluation will be addressed in our future tasks.
Another challenge to address is the theoretical verification of the performance of the naive mean-field approximation with diagonal consistency.The adaptive TAP equation or the adaptive-TAP-like equation frequently converges more slowly compared to the simple naive mean-field equation or the simple TAP equation or fails to converge depending on the settings of problems.In this study, the accuracy of the approximation has been clarified only from the experimental aspect.Several mean-field-based algorithms whose performance has been guaranteed theoretically have been proposed in previous studies [44,45].Overcoming this challenge will improve our understanding of I-SusP, or the naive mean-field method proposed in this study, and the adaptive TAP approximation.
hand, the order of the terms of O(α 3 ) and O(α 4 ) are given by O(N −1 ) and O(N −2 ), respectively.As γ increases, the order of the higher-order terms becomes smaller with respect to N. From this, higher-order terms O(α γ ) (γ ≥ 3) can be negligible in the N → ∞ limit.By setting α = 1, we obtain Equations ( 24) and (36) with Eq. (A7) correspond to the TAP equation in Reference [39].

FIG. 1 . 3 MSEFIG. 2 .
FIG.1.Plot of MSE m against σ when X = {−1, +1}.The points labeled as "naive" and "naive + dc" are the results obtained by the simple naive mean-field approximation and the naive mean-field approximation with diagonal consistency, respectively.Each point in the plot denotes the average value over 1000 trials.