Approximate Message Passing with Restricted Boltzmann Machine Priors

Approximate Message Passing (AMP) has been shown to be an excellent statistical approach to signal inference and compressed sensing problem. The AMP framework provides modularity in the choice of signal prior; here we propose a hierarchical form of the Gauss-Bernouilli prior which utilizes a Restricted Boltzmann Machine (RBM) trained on the signal support to push reconstruction performance beyond that of simple iid priors for signals whose support can be well represented by a trained binary RBM. We present and analyze two methods of RBM factorization and demonstrate how these affect signal reconstruction performance within our proposed algorithm. Finally, using the MNIST handwritten digit dataset, we show experimentally that using an RBM allows AMP to approach oracle-support performance.


Introduction
Over the past decade, a groundswell in research has occurred in both difficult inverse problems, such as those encountered in Compressed Sensing (CS) [1], and in signal representation and classification via deep networks. In recent years, Approximate Message Passing (AMP) [2] has been shown to be a near-optimal, efficient, and extensible application of belief propagation to solving inverse problems which admit a statistical description.
While AMP has enjoyed much success in solving problems for which an i.i.d. signal prior is known, only a few works have investigated the application of AMP to more complex, structured priors. Utilizing such complex priors is key to leveraging many of the advancements recently seen in statistical signal representation. Techniques such as GrAMPA [3] and Hybrid AMP [4] have shown promising results when incorporating correlation models directly between the signal coefficients, and in fact the present contribution is similar in spirit to Hydrid AMP.
Another possible approach is to not attempt to model the correlations directly, but instead to utilize a bipartite construction via hidden variables, as in the Restricted Boltzmann Machine (RBM) [5,6]. The RBM is an example of latent variable model, which we distinguish from the fully visible models considered by [3,4]. Such latent variable models can become quite powerful, as they admit interpretations such as feature extractors and multi-resolution representations. As we will show, if a binary RBM can be trained to model the support patterns of a given signal class, then the statistical description of the RBM easily admits its use within AMP. This is particularly interesting since RBMs are the building blocks of deep belief networks [7] and have enjoyed a surge of renewed interest over the past decade, partly due to the development of efficient training algorithms (e.g. Contrastive Divergence (CD) [6]). The present paper demonstrates the first steps in incorporating deep learned priors into generalized linear problems.

Approximate message passing for compressed sensing
Loopy Belief Propagation (BP) is a powerful iterative message passing algorithm for graphical models [8,9]. However, it presents two main drawbacks when applied to highly-connected, continuous-variable problems, as in CS: first, the need to work with continuous probability distributions; and second, the necessity to iterate over one such probability distribution for each pair of variables. These problems can be addressed by projecting the distributions onto their first two moments and by approximating the messages, which exist on the edges of the factor graph representation of the inference problem, by the marginals existing on the nodes of the factor graph. Applying both to BP for the CS problem, one obtains the AMP iteration.
AMP has been shown to be a very powerful algorithm for CS signal recovery, especially in the Bayesian setting where an a priori model of the unknown signal is given. In CS, one has the following forward model to obtain set of M observations y, where F is an M × N matrix, for M N , representing linear observations of an unknown signal x which are then corrupted by an additive zero-mean white Gaussian noise (AWGN) of variance ∆. The measurement rate α = M/N is of particular interest in determining the difficulty of this inverse problem. In the present work, we use subscript notation to denote the individual coefficients of vectors, i.e. y η refers to the η th coefficient of y and the double-subscript notation to refer to individual matrix elements in row-column order.
We can use AMP to estimate a factorization, up to the first two moments, of the posterior distribution P (x|F, y) ∝ P 0 (x)P (y|F, x), where P (y|F, x) is the likelihood of an observation from the AWGN channel and P 0 (x) is a prior on the signal. If the posterior is trivially factorized, or if it can be approximated by a factorized distribution, one can estimate the unknown signal by averaging over the posterior, which is the minimum mean squared error (MMSE) estimate of x. This is in contrast to utilizing a maximum a posteriori (MAP) approach to the solution of this inverse problem. We refer the reader to [2,10,11,12,13], and in particular to [14], for the present notation and the derivation of AMP from BP and now give directly the iterative form of the algorithm. Given an estimate of the factorized posterior mean a i and variance c i for each element of x, a single step of AMP iteration reads These terms can be interpreted as follows. The variables {ω η } and {V η } represent the first and second moments, respectively, of the marginalized messages on the M factors of the factor graph. The variables {R i } and {Σ 2 i } represent the first and second moments, respectively, of the marginalized messages on the N signal variables of the factor graph. As such, these represent the AMP field on the signal, which are the observational beliefs about the continuous factorized PDF at each signal variable. To find the final factorized posterior of the signal, up to the two moment approximation, we must augment this AMP field by the signal prior. Once the new values of R i and Σ 2 i are computed, the new estimates of the posterior mean and variance, a i and c i , given the prior P 0 (x), are calculated as where is a normalization constant, commonly referred to as a partition function in statistical physics. The final form of these equations for differing values of P 0 (x i ) are given in [14]. As seen in Eqs. (7) and (8), in order to use the AMP framework, one must know some information about the class of signals from which x is drawn. Commonly in CS, we are interested in the case where the signal x is sparse, that is, very few of its coefficients are non-zero. The concept of sparsity implies an assumption that the amount of information required to represent x is actually far less than its dimensionality would admit. Essentially, a sparse prior on x is extremely informative due to its low-entropic nature.
Much of the CS literature focuses on convex approaches to this inverse problem. And so, a convex 1 norm is used as a regularizer to bias solutions to the inverse problem towards sparsity. Within the probabilistic framework, this corresponds to a selection of a Laplace distribution for P 0 (x i ).
However, AMP is not restricted to convex priors, but can utilize non-convex priors of arbitrary complexity. For example, one can use a two-mode Gauss-Bernoulli (GB) prior to model sparse signals (as was considered in detail in [11,12,13]) such as where we have introduced Bernoulli random variables v i such that E [v i ] = ρ, on which the values {x i } are conditioned as P 0 (x i |v i = 1) = N (µ, σ 2 ), and P 0 (x i |v i = 0) = δ(x i ), where δ(·) is the zero-centered Dirac delta distribution. This leads to the final expression for the prior, The GB prior has two possible modes: a zero mode and a non-zero one. If we denote the set of non-zero coefficients of x, those for which v i = 1, to be the support, then the GB prior models both the on-and off-support probability. For ρ ∈ [0, 1], the distribution splits the probabilistic weight between a hard (deterministic) constraint of x i = 0 and a normal distribution for arbitrary parameters that model the on-support coefficients.
For a fixed on-support distribution, here for fixed µ and σ 2 , the value of ρ controls how informative P 0 (x; ρ) is. The more informative this prior, the fewer measurements, smaller α, are required in order to successfully infer x from y. Of course, if x is not truly drawn from P 0 (x; ρ), more measurements will be required to account for the mismatch between the true signal and the assumed signal model, as is the case with signals which are not truly sparse but merely compressible.
If, rather than a fixed probability of being on-support for all sites, we instead have a local probability for each specific site to be on-support, we can write an independent, but non-identically distributed, prior. For conciseness, we refer to this property as "non-i.i.d." in the remainder of this paper. We easily generalize Eq. (10) to This change in the prior must also be reflected in the computation of the means and variances used in in Eqs. (7) and (8). In fact, the partition function becomes where we can write the partition using two sub-partition terms related to the off-support (Z z i ) and on-support (Z nz i ) probabilities. From this partition function, for a given setting of R i and Σ 2 i (which result from the AMP evolution), we can write the posterior means and variances, a and c according to the fixed GB prior parameters {ρ i }, µ, and σ 2 .
A nice feature of this two-mode prior is that it also admits a natural estimation of the probability of a particular coefficient to be on-or off-support. Specifically, at a given point in the AMP evolution, we have Prob where The additional information provided by AMP results in a modified support probabilitỹ Explicating P (v) allows us to envision more complex support models for the coefficients of x. The previous model assumes the independence between the coefficients of x, however, the existence of dependencies, now well-acknowledged for many natural signals as structured sparsity, can be leveraged through joint models.
In the variational Bayesian context, we cite [4,15,16], which consider neighborhood probabilities, Markov chains, and so-called Boltzmann machines, respectively, as generic support models. In similar vein, we propose the use of a binary RBM as a joint support model. In contrast with the models of [4,15], the RBM model can provide a more accurate modeling of support correlations. Also, the RBM model can be trained at low cost [6], which is the main bottleneck of the general Boltzmann machine model used in [16].

Binary restricted Boltzmann machines
An RBM is an energy based model defined over both visible and a set of latent, or hidden, variables. From the perspective of statistical physics, the RBM can be viewed as a boolean Ising model existing on a bipartite graph. The joint probability distribution over the visible and hidden layers for the RBM is given by where the energy E(v, h) reads The RBM model is described by the two sets of biasing coefficients b v and b h on the visible and hidden layers, respectively, and the learned connections between the layers represented by the matrix W.
In the sequel, we leverage the connection between the RBM and the well known results of statistical physics to discuss a simplification of the RBM under the so-called mean-field approximation in both the 1 st -order and 2 nd -order approximation, known as Thouless-Anderson-Palmer (TAP) approximation [17,18,19], in order to obtain factorizations over the visible and hidden layers from this joint distribution.

Mean-field approximation of the RBM
Given the value of an Energy-function E({x}), also called Hamiltonian in physics, a standard technique is to use the Gibbs variational approach where the Gibbs free energy F is minimized over a trial distribution, P var , with where · {Pvar} denotes the average over distribution P var and S Gibbs is the Gibbs entropy.
It is instructive to first review the simplest variational solution, namely, the 1 storder naïve mean-field (NMF) approximation, where P var = i Q i (x i ). Within this ansatz, a classical computation shows that the free energy, in the case of the binary RBM, reads as where Since the hidden variables are also binary, this identity is equally true forh j . The fixed points of the means of both the visible and hidden units are of particular interest. With these fixed points we can calculate a factorization for both P (v) and P (h). First, we look at the derivatives of the NMF free energy for the RBM w.r.t. the visible and hidden sites, which, when evaluated at the critical point, gives us the fixed-point conditions for the expected values of the variables where sigm (x) [1 + e −x ] −1 . These equations are in line with the assumed NMF fixedpoint conditions used for finding the "site activations" given in the RBM literature.
In fact, they are often used as a fixed-point iteration (FPI) to find the minimum free energy.
We will now modify the NMF solution via a 2 nd -order correction, as was originally shown for RBMs in [20] using TAP. The TAP approach is a classical tool in statistical physics and spin glass theory which improves on the NMF approximation by taking into account further correlations. In many situations, the improvement is drastic, making the TAP approach very popular for statistical inference [19]. There are many ways in which the TAP equations can be presented. We shall refer, for the sake of this presentation, to the known results of the statistical physics community. One approach to derive the TAP approximation is to recognize that the NMF free energy is merely the first term in a perturbative expansion in power of the coupling constants W, as was shown by Plefka [21], and to keep the 2 nd -order term. Alternatively, one may start with the Bethe approximation and use the fact that the system is densely connected [19,18]. Proceeding according to Plefka, the TAP free energy reads where we have denoted the variances of hidden and visible variables,ĥ andv, Eqs. (22) and (23), often called the TAP equations, can be seen are an extension of the mean field iteration of Eqs. (19) and (20). The additional term is called the Onsager retro-action term in statistical physics [18]. In fact, these are the tools one uses in order to derive AMP itself. Given an RBM model, we now have two approximated solutions to obtain the equilibrium marginal through the iteration of either Eqs. (19) and (20) or Eqs. (22) and (23).
Because of the sigmoid functions in the fixed-point conditions, iterating on these fixed points will not wildly diverge. However, it is possible that such an FPI will arrive at one of the two trivial solutions for the factorization, either the ground-state of the field-less RBM or 1 2 (sign(b v ) + 1). Whether or not the FPI arrives at these trivial points relies on the balance between the evolution of the FPI and the contributing fields. It may also enter an oscillatory state, especially if the learned RBM couplings are too large in magnitude, reducing the accuracy of the Plefka expansion which assumes small magnitude couplings. We shall see, however, that this FPI works extremely well in practice.

RBMs for AMP
The scope of this work is to use the RBM model within the AMP framework and to perform inference according to the graphical model depicted in Fig. 1. Here, we would like to utilize the binary RBM to give us information on each site's likelihood of being on-support, that is, its probability to be a non-zero coefficient. This shoehorns nicely into our sparse GB prior as in Eq. (10).
Since, in the case of an RBM, P (v i ) has the classical exponential form of an energybased model, we see from Eq. (13) that the information provided by AMP simply amounts to an additional local bias on the visible variables equal to ln γ i . That is, the AMP-modified RBM free energy is simply the introduction of an additional field term along with a constant bias, As we can see, this field effect only exists on the visible layer of the RBM. Because of this, the AMP framework does not put any extra influence on the hidden layer, but only on the visible layer. Thus, the fixed point of the hidden layer means is not influenced by AMP, but the visible ones are. With respect to (19) and (22), the AMP-modified fixed points of the the visible variable means contain one additional additive term within the sigmoid, givinḡ for the NMF-based fixed point. For the TAP we havē The most direct approach for factorizing the AMP influenced RBM is to construct a fixed-point iteration (FPI) using the NMF or TAP fixed-point conditions. We give

Prior Side
Observation Side the final construction of the AMP algorithm with the RBM support prior in Alg. 1 using this approach. This approach can be understood in terms of its graphical representation given in Fig. 1, which shows the network of statistical dependencies from the observations to the hidden RBM hidden units. Given some initial condition forv andh, we can successively estimate the visible and hidden layers via their respective fixed-point equations. Empirically, we have found that the hidden-layer variables should be initialized to zero, while the visible side variables can be initialized to zero, uniformly randomly, or by drawing a random initial condition according to the distribution implied by the RBM visible bias b v . For the FPI schedule, one might intuitively attempt an approach similar to persistent contrastive divergence [22] and allow the values ofv andh persist throughout the AMP FPI, taking only a single FPI step on the RBM magnetizations at each AMP iteration. Indeed, we have found this to be the most computationally efficient integration of the RBM into the AMP framework. However, we point out that this persistent strategy should only begin after a few AMP iterations have been completed. In the first iterations, the RBM factorization should instead be allowed to converge on a value ofv. Persistence of the hidden and visible magnetizations from the first iteration leads to poor reconstruction performance, especially when α is small. Because the early values of {ln γ i } can be quite weak, using only a single update step on the RBM magnetizations early in the reconstruction does not adequately enforce the support prior. Later, as the AMP fields grow in magnitude, a single-step persistent update of the RBM magnetizations can be used to decrease run time.
Finally, after obtaining the value ofv, either by running the RBM FPI until convergence or by taking a single step onv, we must infer the correct values {ρ i }. One might at first attempt to use the setting {ρ i =v i }, however, this is an improper approach for the hierarchical support prior we have proposed. Instead, one should use to obtain the correct per-pixel sparsity terms to use in conjunction with the standard Gauss-Bernoulli form of (7) and (8). (20)

Numerical results with MNIST
To show the efficacy of the proposed RBM-based support prior within the AMP framework, we present a series of experiments using the MNIST database of handwritten digits. Each sample of the database is a 28 × 28 pixel image of a digit with values in the range [0, 1]. In order to build a binary RBM model of the support of these handwritten digits, the training data was thresholded so that all non-zero pixels were given a value of 1. We then train an RBM with 500 hidden units on 60,000 training samples from the binarized MNIST training set using the sampling-based contrastive divergence (CD-1) technique [6] for 100 epochs, averaging the RBM model parameter gradients across 100-sample minibatches, at a learning rate of 0.005 under the prescriptions of [23]. We Comparison of MNIST reconstruction performance over 300 test set digits over the measurement rate α. For both charts, lower curves represent greater reconstruction performance. (Left) Percentage of test set digit images successfully (MSE ≤ 10 −4 ) recovered. Note that the AMP-RBM model comes very close to the oracle reconstruction performance bound for this test set. (Right) Correlation, in terms of MCC, between the recovered support and the true digit image support. Here, the solid lines represent the average correlation at each tested α, while the solid regions represent the range of one standard deviation of the correlations over the test set.
additionally impose an 2 weight-decay penalty on the magnitude of the elements of W at a strength of 0.001. Such penlaties, while also desirable for learning performance [23], are necessary for the TAP-based AMP-RBM due to the fact that the Plefka expansion of Eq. (21) is reliant on the magnitudes of W being small. For further reading on training RBMs, as well as other undirected graphical models, we refer the reader to [24]. Once the generative RBM support model is obtained, we construct the CS experiments as follows. For a given measurement rate α, we draw the i.i.d. entries of F from a normal distribution of variance 1/ √ N . The linear projections F x are subsequently corrupted with an AWGN of variance ∆ = 10 −8 to form y. In all experiments we utilize the first 300 digit images from the MNIST test set.
We compare the following approaches in Fig. 2. First, we show the reconstruction performance of AMP using an i.i.d. GB prior (AMP-GB), assuming that the true image-wide empirical ρ is given as a parameter for each specific test image. Next, we demonstrate a simple modification to this procedure: the GB prior is assumed to be noni.i.d. and the values of {ρ i } are empirically estimated from the training samples as the probability of each pixel to be non-zero. We expect that our proposed approach should at least perform as well as non-i.i.d. AMP-GB, as this same information should be encoded in b v for a properly trained RBM model. Hence, this approach should correspond to an RBM with zero couplings. We also show the performance of the proposed approach: AMP used in conjunction with the RBM support model. We present results for both the NMF and TAP factorizations of the RBM. For the AMP-RBM approaches, we start persistence after AMP 50 iterations. For all tested approaches, we assume that ∆ is known to the reconstruction algorithms, as well as the prior parameters µ and σ 2 . This is not a strict requirement, however, as channel and prior parameters can be estimated within the AMP iteration if so desired [14].
In the left panel of Fig. 2, we present the percentage of successfully recovered test set digit images of the 300 digit images tested. A successful reconstruction is denoted as one which achieves MSE ≤ 10 −4 . It is easily observable from these results that leveraging a non-i.i.d. support prior does indeed provide drastic performance improvements, as even the simple non-i.i.d. version of AMP-GB recovers significantly more digits than i.i.d. AMP-GB. We also see that by using the RBM model of the support, along with a TAP-based support probability estimation, we are able to improve upon this simple approach. For example, at α = 0.22, by using the RBM support prior in conjunction with TAP factorization we are able to recover an additional 56.0% of the test set than with no support information, and an additional 29% than when using only empirical percoefficient support probabilities. The fact that both RBM-AMP approaches improve on non-i.i.d. AMP-GB demonstrates that the learned support correlations are genuinely providing useful information during the AMP CS reconstruction procedure. We also note that the 2 nd -order TAP factorization provides reconstruction performance on this test set which is either equal to, or better than, the NMF factorization, at the cost of an additional matrix-vector multiplication at each iteration of the RBM factorization. To demonstrate how these approaches compare with maximum achievable performance, we also show the support oracle performance, which corresponds to the percentage of test samples for which ρ ≤ α. The proposed AMP-RBM approach, for the given RBM support model, closes the gap to oracle performance.
In the right panel of Fig. 2, we show the performance of the support estimation in terms of the Matthews Correlation Coefficient (MCC) which is calculated from the 2 × 2 confusion matrix of estimated support and the true support. We observe from this chart that, even though measurement rate might be so low as to prevent an accurate reconstruction in terms of MSE, the estimated support of the recovered image may indeed be highly correlated with the true image. However, the values of the these on-support coefficients will not be correctly estimated in this regime. The estimated support may still be used for certain tasks in this case, such as classification. More visual evidence of this effect can be seen in Fig. 3, were we see, even to the down to the extreme limit of α < 0.1, the support of the AMP-RBM recovered images are still quite correlated with the true image. For non-i.i.d. AMP-GB, the effect of the prior effectively forces the support to the central region of the image, and for i.i.d. AMP-GB, the support information is completely lost, having almost zero correlation with the true image.
It is also of interest to analyze the effect of RBM training and model parameters on the performance of the proposed approach. Indeed, it is curious to note exactly how well trained the RBM must be in order to obtain the performance demonstrated above. In the right panel of Fig. 4 we see that a lightly trained, here on the order of 40 epochs, RBM attains maximal performance, showing that an overwrought training procedure is not necessary in order to obtain significant performance improvements for CS reconstruction. Additionally, in the left panel of Fig. 4 we see that increasing the complexity of the RBM model, and therefore more accurately estimating the joint support probability,increases reconstruction performance, showing that more complex RBMs, perhaps even stacked RBM models, may have the potential to further improve upon these results.
Lastly, in terms of computational efficiency and scalability, the inclusion of the inner RBM factorization loop does increase the computational burden of the reconstruction in proportion to the number of hidden units and the number of factorization iterations required for convergence. However, these iterations are computationally light in comparison to the AMP FPI and so the use of the RBM support prior is not unduly burdensome.

Conclusion
In this work we show that using an RBM-based prior on the signal support, when learned properly, can provide CS reconstruction performance superior to that of both simple i.i.d. and empirical non-i.i.d. sparsity assumptions for a message passing approach such as AMP. The implications of such an approach are large as these results pave the way for the introduction of much more complex and deep-learned priors. Such priors can be applied to the signal support as we have done here, or further modifications can be made to adapt the AMP framework to the use of RBMs with real-valued visible layers. Such priors would even aid in moving past the M = K oracle support transition.
Additionally, a number of interesting generalizations of our approach are possible. While the experiments we present here are only concerned with linear projections observed through an AWGN channel, much more general, non-linear, observation models can be used moving from AMP to GAMP [11]. Our approach can be then readily applied with essentially no modification to the algorithm. With the successful application of statistical physics tools to signal reconstruction, as was done in applying TAP to derive AMP, similar approaches could be adapted to produce even better learning algorithms for single and stacked RBMs. Perhaps such future works might allow for the estimation of the RBM model in parallel with signal reconstruction.