Empirical Bayes method for Boltzmann machines

We consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows for estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method is quite simple and fast because it does not require any iterative procedures and gives reasonable estimates at a certain condition. However, our method introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka expansion. A possible extension to overcome this behavior is also discussed.


Introduction
Boltzmann machine learning (BML) [1] has been actively studied in the field of machine learning and also in statistical mechanics. In statistical mechanics, the problem of BML is sometimes referred to as the inverse Ising problem, because a Boltzmann machine is the same as an Ising model, and BML can be regarded as an inverse problem for the Ising model. The framework of the usual BML is as follows. Given a set of observed data points (e.g. spin snapshots), we estimate appropriate values of the parameters, the external field and couplings, of our Boltzmann machine through maximum likelihood (ML) estimation (see section 2.1). Because BML involves intractable multiple summations (i.e. evaluation of the partition function), many approximations for it were proposed from the viewpoint of statistical mechanics [2]: for example, methods based on mean-field approximations (such as the Plefka expansion [3] and the cluster variation method [4]) [5][6][7][8][9][10][11] and methods based on other approximations [12,13].
In this study, we focus on another type of learning problem. We consider prior distributions of parameters of the Boltzmann machine and assume that the prior distributions are governed by some hyperparameters. The introduction of the prior distributions is strongly connected with the regularized ML estimation, in which the hyperparameters can be regarded as regularization coefficients (see section 2.1). The regularized ML estimation is important in preventing over-fitting to the dataset. In particular, the over-fitting problem becomes serious for small datasets. As mentioned above, the aim of the usual BML is to optimize the values of the parameters of the Boltzmann machine by using a set of observed data points. Meanwhile, the aim of the problem investigated in this study is the estimation of appropriate values of the hyperparameters from the dataset without estimating specific values of the parameters. One way to allow us to accomplish this from the Bayesian point of view is the empirical Bayes method (or also called type-II ML estimation or evidence approximation) [14,15] (see section 2.2). The schemes of the usual BML and of our problem are illustrated in figure 1.
However, the evaluation of the likelihood function in the empirical Bayes method is again intractable, because it involves intractable multiple integrations of the partition function. In this study, we analyze the empirical Bayes method for fully-connected Boltzmann machines, using statistical mechanical techniques based on the replica method [16,17] and the Plefka expansion to derive an algorithm for it. We consider two types of cases of the prior distribution of J: the cases of Gaussian and Laplace priors.
The rest of this paper is organized as follows. The formulations of the usual BML and the empirical Bayes method are presented in section 2. In section 3, we describe our statistical mechanical analysis for the empirical Bayes method. The proposed inference algorithm obtained from our analysis is shown in section 3.3 with its pseudocode. In section 4, we examine our proposed method through numerical experiments. Finally, the summary and some discussions are presented in section 5.

Boltzmann machine and prior distributions
Consider a fully-connected Boltzmann machine with n Ising variables S : [1]: where i<j is the sum over all the distinct pairs of variables; i.e. i<j = n i=1 n j=i+1 . Z(h, J) is the partition function defined by where S is the sum over all the possible configurations of S; i.e. S := n i=1 Si=±1 . The parameters, h ∈ (−∞, +∞) and J := {J ij ∈ (−∞, +∞) | i < j}, denote the external field and couplings, respectively.
Now, we introduce prior distributions for the parameters h and J as P prior (h | H) and respectively. H and γ are the hyperparameters of these prior distributions. One of the most important motivations for introducing the prior distributions is for a Bayesian interpretation of the regularized ML estimation [15]. Given the observed dataset D, by using the prior distributions, the posterior distribution of h and J is expressed as where The distribution in the denominator in equation (5), P(D | H, γ), is sometimes referred to as the evidence. By using the posterior distribution, the maximum a posteriori (MAP) estimation of the parameters is obtained as where The MAP estimation in equation (6) corresponds to the regularized ML estimation, in which R 0 (h) := ln P prior (h | H) and R 1 (J) := ln P prior (J | γ) work as penalty terms. For example, (i) when the prior distribution of J is the Gaussian prior, R 1 (J) corresponds to the L 2 regularization term, and γ corresponds to its coefficient; (ii) when the prior distribution of J is the Laplace prior, R 1 (J) corresponds to the L 1 regularization term, and γ again corresponds to its coefficient. The variances of these prior distributions are identical, Var[J ij ] = γ/n. In this study, as a simple test case, we use these two prior distributions for J and where δ(x) is the Dirac delta function, for h.

Framework of the empirical Bayes method
Using the empirical Bayes method, we can infer the values of the hyperparameters, H and γ, from the observed dataset D. We define a marginal log-likelihood function as where [· · · ] h,J is the average over the prior distributions; i.e.
We refer to the marginal log-likelihood function as the empirical Bayes likelihood function in this study. From the perspective of the empirical Bayes method, the optimal values of the hyperparameters, Ĥ and γ, are obtained by maximizing of the empirical Bayes likelihood function; i.e.
It is noteworthy that [P(D | h, J)] h,J in equation (11) is identified as the evidence appearing in equation (5).
The marginal log-likelihood function can be rewritten as Consider the case N n. In this case, by using the saddle point evaluation, equation (13) is reduced to In this case, the empirical Bayes' estimates {Ĥ,γ} thus converge to the ML estimates of the hyperparameters in the prior distributions in which the ML estimates of the parameters {ĥ ML ,Ĵ ML } (i.e. the solution to the BML) are inserted. This indicates that the parameter estimations can be conducted independently of the hyperparameter estimation. In this study, we do not concern ourselves with this trivial case. As discussed, the hyperparameters correspond to the regularization coefficients in the regularized ML estimation. Generally, the appropriate values of the regularization coefficients can be estimated using two methods. First method is the cross validation (CV) method, such as the leave-one-out CV. The CV method is usually used with the regularized ML estimation (or the MAP estimation). In the CV method, the validity of the solution to the regularized ML estimation is checked using the validation dataset separated from the training dataset, and the values of the regularization coefficients providing the optimal solution is selected. This method is effective when the validity of the solution can be checked easily, that is the case in e.g. regression or pattern recognition problems. However, there are some applications in which the validity cannot be checked easily. The graph mining problem [18,19] is such an example, because we do not know what structure is the best in advance. The second method is the empirical Bayes approach discussed in this paper. This approach allows us to estimate the appropriate hyperparameters in cases where the CV method cannot be used directly, because it is basically based on the fully Bayesian treatment and does not require explicit validation. The empirical Bayes approach is thus important for problems involving difficulties in validation.

Statistical mechanical analysis
The empirical Bayes likelihood function in equation (11) involves intractable multiple integrations. In this section, we evaluate the empirical Bayes likelihood function using a statistical mechanical analysis. We consider the two types of the prior distribution of J: one is the Gaussian prior in equation (8), and the other is the Laplace prior in equation (9).
First, we evaluate the empirical Bayes likelihood function on the basis of the Gaussian prior in sections 3.1-3.3, after which we describe the evaluation based on the Laplace prior in section 3.4.

Replica method
The empirical Bayes likelihood function in equation (11) can be represented as where and are the sample averages of the observed data points. We assume that τ x := xN is a natural number, and therefore equation (15) can be expressed as is the set of all the Ising variables in the replicated system, and Sx is the sum over all the possible configurations of S x ; i.e.
. We evaluate Ψ x (H, γ) under the assumption that τ x us a natural number, after which we take the limit of x → −1 of the evaluation result to obtain the empirical Bayes likelihood function (this is the so-called replica trick).
By employing the Gaussian prior in equation (8), equation (16) becomes where and is the replicated (Helmholtz) free energy [20][21][22][23]; here, is the Hamiltonian of the replicated system, where a<b is the sum over all the distinct pairs of replicas; i.e. a<b = τx a=1 τx b=a+1 .

Plefka expansion
Because the replicated free energy in equation (19) includes intractable multiple summations, an approximation is needed to proceed with our evaluation. In this section, we approximate the replicated free energy using the Plefka expansion [3]. In brief, the Plefka expansion is the perturbative expansion in a Gibbs free energy that is a dual form of a corresponding Helmholtz free energy. The Gibbs free energy is obtained as The derivation of this Gibbs free energy is described in appendix A. It is noteworthy that this type of expression of the Gibbs free energy implies the replica-symmetric (RS) assumption. To take the replica-symmetry breaking (RSB) into account, explicit treatments of overlaps between different replicas are needed [21]. By expanding G x (m, H, γ) around γ = 0, we obtain where e(m) is the negative mean-field entropy defined by and the coefficients, φ x (m) and φ (2) x (m), are expressed as equations (B.4) and (B.9), respectively. The detailed derivation of these coefficients is presented in appendix B.

Inference algorithm
As mentioned in section 2.2, the empirical Bayes inference is achieved by maximizing L EB (H, γ) with respect to H and γ (see equation (12)). From the extremum condition of equation (24) with respect to H, we obtain where m is the value of m that satisfies the extremum condition in equation (24). From the extremum condition of equation (24) with respect to m and equation (29), we obtain From equations (24) and (29), the optimal value of γ is obtained bŷ From equation (31), γ is immediately obtained as follows: (i) when φ −1 (M)), and (iii) γ → ∞ elsewhere. Here, we ignore the case φ (2) −1 (M) = Φ(M) = 0, because it hardly occurs in realistic settings. By using equations (30) and (31), we can obtain the solution to the empirical Bayes inference without any iterative processes. The pseudocode of the proposed procedure is shown in algorithm 1. 2: Compute M, Ω, C 1 , and C 2 using the data set according to equations (18) and (27). 3: Determine γ using equation (31): In the proposed method, the value of Ĥ does not affect the determination of γ. Many mean-field-based methods for BML (e.g. listed in section 1) have similar procedures, in which Ĵ ML are determined separately from ĥ ML . This is seen as one of the common properties of the mean-field-based methods for BML including the current empirical Bayes problem.

Evaluation based on Laplace prior
The above evaluation was for the Gaussian prior in equation (8). Here, we explain the evaluation for the Laplace prior in equation (9). By employing the Laplace prior in equation (9), equation (16) becomes where ξ := 2n/γ . Here, we assume By using the perturbative approximation, we obtain the approximation of equation (32) as The right-hand side of this equation coincides with Ψ Gauss x (H, γ) in equation (17). This means that the empirical Bayes inference based on the Laplace prior in equation (9) is (approximately) equivalent to that based on the Gaussian prior in equation (8) when the assumption of equation (33) is justified. Thus, we can also use the algorithm presented in section 3.3 for the case of the Laplace prior.

Numerical experiments
In this section, we describe the results of our numerical experiments. In these experiments, the observed dataset D are generated from the generative Boltzmann machine, which has the same form as equation (1), by using annealed importance sampling (AIS) [24]. In AIS, we control the annealing schedule using a series of inverse temperature 0 = β 0 < β 1 < · · · < β T < β final = 1, where we used the annealing schedule of β t+1 = β t + 0.03. The parameters of the generative Boltzmann machine are drawn from the prior distributions in equations (4) and (10). That is, we consider the model-matched case (i.e. the generative and learning models are identical).
In the following, we use the notations α := N/n and J := √ γ. The standard deviations of the Gaussian prior in equation (8) and of the Laplace prior in equation (9) are then J/ √ n. We express the hyperparameters for the generative Boltzmann machine by H true and J true .

Gaussian prior case
Here, we consider the case in which the prior distribution of J is the Gaussian prior in equation (8). In this case, the Boltzmann machine corresponds to the Sherrington-Kirkpatrick (SK) model [25], and therefore it shows the spin-glass transition at J = 1 when h = 0 (i.e. when H = 0).
First, we consider the case H true = 0. We show the scatter plots for the estimation of Ĵ for various J true when H true = 0 and α = 0.4 in figure 2. The detailed values of the plots for some J true values are shown in table 1.
When J true < 1, our estimates of Ĵ are in good agreement with J true . This implies that the validity of our perturbative approximation is lost in the spin-glass phase, as is often the case with many mean-field approximations. Figure 3 shows the scatter plots for various α. A smaller α causes Ĵ to be overestimated and a larger α causes it to be underestimated. At   i . M µ is expected to have the self-averaging property; thus, its variance is negligible for n 1. In this case, M takes approximately the same value regardless of the size of N, and therefore, M can have sufficient information to achieve the appropriate estimation even though N is small.
One of the largest qualitative differences between the cases H true = 0 and H true > 0 is the scale of α. In the case H true = 0, the optimal α was scaled by O(1) with respect to n (i.e. N = O(n)). Meanwhile, in the case H true > 0, the optimal α is scaled by O(1/n) with respect to n (i. e. N = O(1)). This change of scale can be understood from a scale evaluation for the terms in the empirical Bayes likelihood function in equation (24). The detailed reasoning is given in appendix C.

Laplace prior case
Here, we consider the case in which the prior distribution of J is the Laplace prior in equation (9). The scatter plots for the estimation of Ĵ for various J true values when H true = 0 are shown in figure 8. The plots shown in figure 8 almost completely overlap with those in figure 3. Furthermore, all the numerical results in the case H true > 0 also almost completely overlap with the corresponding results obtained in the above Gaussian prior case, and therefore we do not show those results.

Comparison with other method
Here, we compare the proposed method with a method based on the maximum pseudo-likelihood estimation (MPLE) [26]. In the MPLE, we approximate the log-likelihood function defined in equation (2) by the pseudo-likelihood function, expressed as By using the pseudo-likelihood function, we approximate the empirical Bayes likelihood function in equation (13) as It is known that, when the data points are generated from the same Boltzmann machine, the solution of the MPLE converges to that of the ML estimation for N → ∞ limit [27]. Thus, it can be expected that the solution obtained by the maximization of equation (35) is good when N 1. For simplicity, we ignore the external field h and consider the maximizing problem with only γ: γ = arg max γ L PL EB (γ), and the prior distribution of J is the Gaussian prior with J true = 0.6. In the experiment, we maximize L PL EB (γ) numerically based on the line search and evaluate the multiple integrations with respect to J using the Monte Carlo integration with 10 4 samples. The results are summarized in table 4. Unlike our expectation, the estimated Ĵ becomes worse as N increases. This is presumably because of the Monte Carlo integration used to evaluate the multiple integrations over J. The distribution of the integrated function, exp(nNL PL (J)), is strongly localized around the maximum point of L PL (J) when nN is large. A precise evaluation for such strongly localized distribution by the simple Monte Carlo integration is generally difficult.
From the above observations, we believe that the pseudo-likelihood approach cannot be effective unless a special treatment is introduced for evaluating the multiple integrations.

Summary and discussions
In this study, we proposed a hyperparameters inference algorithm by analyzing the empirical Bayes likelihood function in equation (11) using the replica method and the Plefka expansion. The validity of our method was examined in numerical experiments for the Gaussian and Laplace priors, which demonstrated the existence of an appropriate scale in the size of the dataset that can accurately recover the values of the hyperparameters.
However, some problems remain. The first one is the scale of N. In our experiments, we found that an appropriate N is scaled by O(n) when H true = 0 or by O(1) when H true = 0. However, such scales seem to be unnatural, because they should not appear in the original framework of the empirical Bayes method. As discussed in section 2.2, when N n, maximizing the empirical Bayes likelihood function is reduced to the ML estimation of the prior distributions for the solution to BML. This must lead to the correct γ and Ĥ , because the solution to BML is perfect when N → ∞. For reference, we show the results obtained from the original framework. When n is small, we can numerically evaluate L ML (h, J); therefore, we can evaluate equation (13) numerically. Figure 9 shows the MAE between Ĵ and J true (i.e. the average of |J true −Ĵ|) for various N when n = 5. In the experiment, we ignore h and evaluate the multiple integrations over J using the Monte Carlo integration (with 10 5 samples), as we did in section 4.3. The MAEs monotonically decrease with an increase in N and the unnatural scales, that our method has, does not seem to appear. Therefore, such unnatural scales appear due to our approximation, which is also supported by a scale analysis given in appendix C. An improvement of the approximation (e.g. by evaluating the leading terms in the Plefka expansion or using some other approximations) might reduce these unnatural behaviors.
The second problem is the optimal value of α = N/n. Empirically, we found that α opt ≈ 0.4 when H true = 0 and that it decreases as H true increases (e.g. α opt ≈ 30/n when H true = 0.2 and α opt ≈ 5/n when H true = 0.4). As can be seen in the results of our experiments, the solution to our method is robust for the choice of α when J true is small ( J true < J c ) and is sensitive to it when J true is large ( J true > J c ), where J c ≈ 0.4. The estimation of α opt is very important for our method, and it will make our method more practical. This problem would be strongly related to the first problem.
The third problem is the degradation of the estimation accuracy in the spin-glass phase. In our experiments, the estimation accuracies of γ and Ĥ were obviously degraded in the spinglass phase. This means that our Plefka expansion based on the RS assumption loses its validity in the spin-glass phase. In [21], a Plefka expansion for the one-step RSB was proposed. Employing this expansion instead of the current expansion could reduce the degradation in the spin-glass phase. These three problems should be addressed in our future studies.
In this study, we used fully-connected Boltzmann machines whose variables are all visible. We are also interested in an extension of our method to other types of Boltzmann machines such as Boltzmann machines having specific structures or hidden variables. Furthermore, we considered the model-matched case (i.e. the case in which the generative mode and learning model are the same model) in the current study, but model-mismatched cases are more practical and important.

Appendix A. Gibbs free energy
In this appendix, we derive the Gibbs free energy for the replicated (Helmholtz) free energy in equation (19). The replicated free energy is obtained by minimizing the variational free energy, defined by under the normalization constraint, i.e. Sx Q(S x ) = 1, where Q(S x ) is a test distribution over S x , and E x (S x ; H, γ) is the Hamiltonian for the replicated system defined in equation (20). The Gibbs free energy is obtained by adding new constraints to the minimization of f [Q].
Here, we add the relation m = (nτ as the constraint. By using Lagrange multipliers, the Gibbs free energy is obtained as where 'extr' denotes the extremum with respect to the assigned parameters. By performing the extremum operation with respect to Q(S) and r in equation (A.2), we obtain The replicated free energy in equation (19) coincides with the extremum of this Gibbs free energy with respect to m; i.e.

Appendix B. Derivation of coefficients of Plefka expansion
The Plefka expansion considered in this study can be obtained by expanding the Gibbs free energy in equation (21)  where e(m) is defined in equation (23).
For the derivations of the coefficients φ (1) x (m) and φ (2) x (m), we decompose E x (S x ; H, λ) in equation (21) into two parts: Coefficient φ (1) x (m) is defined by The derivative leads to where · · · γ denotes the average for the distribution where λ * is the value of λ that satisfies the extremum condition in equation (21) and which is the function relating γ and m; i.e. λ * = λ * (γ, m). From the extremum condition for λ in equation (21), we obtain the equation where K x := τ x (τ x − 1)/2. In the derivation of equation (B.4), we used the relation S x (m) is defined by φ (2) x (m) := 1 2nN From equation (B.2), the second derivative is is Georges's operator, proposed in [28]. To simplify the notation, we omit the explicit description of the dependency of the operator on S x and m. By using this operator, the derivative of A γ with respect to γ is obtained as

This immediately leads to
is obtained, where we have used U x (γ) γ = 0. From equations (B.5) and (B.6), we have where ω i is defined in equation (28). By using equations (B.7) and (B.8), we obtain where Ω is defined in equation (27).

Appendix C. Evaluation of orders of each term in the empirical Bayes likelihood
Here, we evaluate the orders of each term in equation (24) with m = M, with respect to n 1, that is, the orders of each term in (C.1) In the following, we assume that N = O n ρ (ρ 0) and that {S i } is unbiased. In this case, we obtain M = O n −(1+ρ)/2 , C 1 = O n −1−ρ/2 , and Similarly, we obtain This consideration and the experiments in section 4 imply that our method based on the Plefka expansion can be validated when all the terms in the empirical Bayes likelihood are O (1). The introduction of the external field changes the condition to satisfy this criterion, leading to the appropriate scaling of α. This statement is consistent with the numerical observation that a stable result is obtained even for different n's as long as the appropriate scale in α is maintained, as shown in section 4.