Topical Review The following article is Open access

Mean-field inference methods for neural networks

Published 13 May 2020 © 2020 The Author(s). Published by IOP Publishing Ltd
, , Machine Learning and Statistical Physics: Theory, Inspiration, Application Citation Marylou Gabrié 2020 J. Phys. A: Math. Theor. 53 223002 DOI 10.1088/1751-8121/ab7f65

1751-8121/53/22/223002

Abstract

Machine learning algorithms relying on deep neural networks recently allowed a great leap forward in artificial intelligence. Despite the popularity of their applications, the efficiency of these algorithms remains largely unexplained from a theoretical point of view. The mathematical description of learning problems involves very large collections of interacting random variables, difficult to handle analytically as well as numerically. This complexity is precisely the object of study of statistical physics. Its mission, originally pointed toward natural systems, is to understand how macroscopic behaviors arise from microscopic laws. Mean-field methods are one type of approximation strategy developed in this view. We review a selection of classical mean-field methods and recent progress relevant for inference in neural networks. In particular, we remind the principles of derivations of high-temperature expansions, the replica method and message passing algorithms, highlighting their equivalences and complementarities. We also provide references for past and current directions of research on neural networks relying on mean-field methods.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

With the continuous improvement of storage techniques, the amount of available data is currently growing exponentially. While it is not humanly feasible to treat all the data created, machine learning, as a class of algorithms that allows to automatically infer structure in large data sets, is one possible response. In particular, deep learning methods, based on neural networks, have drastically improved performances in key fields of artificial intelligence such as image processing, speech recognition or text mining. A good review of the first successes of this technology published in 2015 is [LBH15]. A few years later, the current state-of-the-art of this very active line of research is difficult to envision globally. However, the complexity of deep neural networks remains an obstacle to the understanding of their great efficiency. Made of many layers, each of which constituted of many neurons, themselves accompanied by a collection of parameters, the set of variables describing completely a typical neural network is impossible to only visualize. Instead, aggregated quantities must be considered to characterize these models and hopefully help and explain the learning process. The first open challenge is therefore to identify the relevant observables to focus on. Often enough, what seems interesting is also what is hard to calculate. In the high-dimensional regime we need to consider, exact analytical forms are unknown most of the time and numerical computations are ruled out, ways of approximation that are simultaneously simple enough to be tractable and fine enough to retain interesting features are highly needed.

In the context where dimensionality is an issue, physicists have experimented that macroscopic behaviors are typically well described by the theoretical limit of infinitely large systems. Under this thermodynamic limit, the statistical physics of disordered systems offers powerful frameworks of approximation called mean-field theories. Interactions between physics and neural network theory already have a long history as we will discuss in section 6.1. Yet, interconnections have been re-heightened by the recent progress in deep learning, which also brought new theoretical challenges.

Here, we wish to provide a concise methodological review of fundamental mean-field inference methods with their application to neural networks in mind. Our aim is also to provide a unified presentation of the different approximations allowing to understand how they relate and differ. Readers may also be interested in related review papers. Another methodological review is [ALG13], particularly interested in applications to neurobiology. Methods presented in the latter reference have a significant overlap with what will be covered in the following. Some elements of random matrix theory are there additionally introduced. The approximations and algorithms which will be discussed here are also largely reviewed in [ZK16]. This previous paper includes more details on spin glass theory, which originally motivated the development of the classical mean-field methods, and particularly focuses on community detection and linear estimation. Despite the significant overlap and beyond their differing motivational applications, the two previous references are also anterior to some recent exciting developments in mean-field inference covered in the present review, in particular extensions toward multi-layer networks. An older, yet very interesting, reference is the workshop proceedings [OS01], which collected both insightful introductory papers and research developments for the applications of mean-field methods in machine learning. Finally, the recent [CCC+19] covers more generally the connections between physical sciences and machine learning yet without detailing the methodologies. This review provides a very good list of references where statistical physics methods were used for learning theory, but also where machine learning helped in turn physics research.

Given the literature presented below is at the cross-roads of deep learning and disordered systems physics, we include short introductions to the fundamental concepts of both domains. These sections 2 and 3 will help readers with one or the other background, but can be skipped by experts. In section 4, classical mean-field inference approximations are derived on neural network examples. Section 5 covers some recent extensions of the classical methods that are of particular interest for applications to neural networks. We review in section 6 a selection of important historical and current directions of research in neural networks leveraging mean-field methods. As a conclusion, strengths, limitations and perspectives of mean-field methods for neural networks are discussed in section 7.

2. Machine learning with neural networks

Machine learning is traditionally divided into three classes of problems: supervised, unsupervised and reinforcement learning. For all of them, the advent of deep learning techniques, relying on deep neural networks, has brought great leaps forward in terms of performance and opened the way to new applications. Nevertheless, the utterly efficient machinery of these algorithms remains full of theoretical puzzles. This section provides fundamental concepts in machine learning for the unfamiliar reader willing to approach the literature at the crossroads of statistical physics and deep learning. We also take this section as an opportunity to introduce the current challenges in building a strong theoretical understanding of deep learning. A comprehensive reference is [GBC16], while [MWD+18] offers a broad introduction to machine learning specifically addressed to physicists.

2.1. Supervised learning

Learning under supervision. Supervised learning aims at discovering systematic input to output mappings from examples. Classification is a typical supervised learning problem: for instance, from a set of pictures of cats and dogs labeled accordingly, the goal is to find a function able to predict in any new picture the species of the displayed pet.

In practice, the training set is a collection of P example pairs $\mathcal{D}={\left\{{\underline{x}}_{\left(k\right)},{\underline{y}}_{\left(k\right)}\right\}}_{k=1}^{P}$ from an input data space $\mathcal{X}\subseteq {\mathbb{R}}^{N}$ and an output data space $\mathcal{Y}\subseteq {\mathbb{R}}^{M}$. Formally, they are assumed to be i.i.d. samples from a joint distribution $p\left(\underline{x},\underline{y}\right)$. The predictor h is chosen by a training algorithm from a hypothesis class, a set of functions from $\mathcal{X}$ to $\mathcal{Y}$, so as to minimize the error on the training set. This error is formalized as the empirical risk

Equation (1)

where the definition involves a loss function $\ell :\mathcal{Y}{\times}\mathcal{Y}\to \mathbb{R}$ measuring differences in the output space. This learning objective nevertheless does not guarantee generalization, i.e. the ability of the predictor h to be accurate on inputs $\underline{x}$ that are not in the training set. It is a surrogate for the ideal, but unavailable, population risk

Equation (2)

expressed as an expectation over the joint distribution $p\left(\underline{x},\underline{y}\right)$. The different choices of hypothesis classes and training algorithms yield the now crowded zoo of supervised learning algorithms.

Representation ability of deep neural networks. In the context of supervised learning, deep neural networks enter the picture in the quality of a parametrized hypothesis class. Let us first quickly recall the simplest network, the perceptron, including only a single neuron. It is formalized as a function from ${\mathbb{R}}^{N}$ to $\mathcal{Y}\subset \mathbb{R}$ applying an activation function f to a weighted sum of its inputs shifted by a bias $b\in \mathbb{R}$,

Equation (3)

where the weights are collected in the vector $\underline{w}\in {\mathbb{R}}^{N}$. From a practical standpoint, this very simple model can only solve the classification of linearly separable groups (see figure 1). Yet from the point of view of learning theory, it has been the starting point of a rich statistical physics literature that will be discussed in section 6.1.

Figure 1.

Figure 1. Let us assume we wish to classify data points $\underline{x}\in {\mathbb{R}}^{2}$ with labels y = ±1. We choose as an hypothesis class the perceptron sketched on the left with sign activation. For given weight vector $\underline{w}$ and bias b the plane is divided by a decision boundary assigning labels. If the training data are linearly separable, then it is possible to perfectly predict all the labels with the perceptron, otherwise it is impossible.

Standard image High-resolution image

Combining several neurons into networks defines more complex functions. The universal approximation theorem [Cyb89, Hor91] proves that the following two-layer network architecture can approximate any well-behaved function with a finite number of neurons,

Equation (4)

for f a bounded, non-constant, continuous scalar function, acting component-wise. In the language of deep learning this network has one hidden layer of M units. Input weight vectors ${w}_{\alpha }^{\left(1\right)}\in {\mathbb{R}}^{N}$ are collected in a weight matrix ${\underline{\underline{W}}}^{\left(1\right)}\in {\mathbb{R}}^{M{\times}N}$. Here, and in the following, the notation θ is used as short for the collection of adjustable parameters. The universal approximation theorem is a strong result in terms of representative power of neural networks but it is not constructive. It does not quantify the size of the network, i.e. the number M of hidden units, to approximate a given function, nor does it prescribe how to obtain the values of the parameters ${\underline{w}}^{\left(2\right)},{\underline{\underline{W}}}^{\left(1\right)}$ and $\underline{b}$ for the optimal approximation. While building an approximation theory is still ongoing (see e.g. [GPEB19]). Practice, led by empirical considerations, has nevertheless demonstrated the efficiency of neural networks.

In applications, neural networks with multiple hidden layers, deep neural networks, are preferred. A generic neural network of depth L is the function

Equation (5)

Equation (6)

where N0 = N is the dimension of the input and NL = M is the dimension of the output. The architecture is fixed by specifying the number of neurons, or width, of the hidden layers ${\left\{{N}_{l}\right\}}_{l=1}^{L-1}$. The latter can be denoted ${\underline{t}}^{\left(l\right)}\in {\mathbb{R}}^{{N}_{l}}$ and follow the recursion

Equation (7)

Equation (8)

Equation (9)

Fixing the activation functions and the architecture of a neural network defines an hypothesis class. It is crucial that activations introduce non-linearities; the most common are the hyperbolic tangent tanh and the rectified linear unit defined as relu(x) = max(0, x). Note that it is also possible to define stochastic neural networks by using noisy activation functions, uncommon in supervised learning applications except at training time so as to encourage generalization [PSDG14, SHK+14].

An originally proposed intuition for the advantage of depth is that it enables to treat the information in a hierarchical manner; either looking at different scales in different layers, or learning more and more abstract representations [BCV13]. Nevertheless, getting a clear theoretical understanding why in practice 'the deeper the better' is still an ongoing direction of research (see e.g. [Dan17, SES19, Tel16]).

Neural network training. Given an architecture defining hθ, the supervised learning objective is to minimize the empirical risk $\hat{\mathcal{R}}$ with respect to the parameters θ. This optimization problem lives in the dimension of the number of parameters which can range from tens to millions. The idea underlying the majority of training algorithms is to perform a gradient descent (GD) starting at parameters drawn randomly from an initialization distribution:

Equation (10)

Equation (11)

The parameter η is the learning rate, controlling the size of the step in the direction of decreasing gradient per iteration. The computation of the gradients can be performed in time scaling linearly with depth by applying the derivative chain-rule leading to the back-propagation algorithm [GBC16]. A popular alternative to gradient descent is stochastic gradient descent (SGD) where the sum over the gradients for the entire training set is replaced by the sum over a small number of samples, randomly selected at each step [Bot10, Rob51].

During the training iterations, one typically monitors the training error (another name for the empirical risk given a training data set) and the validation error. The latter corresponds to the empirical risk computed on a set of points held-out from the training set, the validation set, to assess the generalization ability of the model either along the training or in order to select hyperparameters of training such as the value of the learning rate. A posteriori, the performance of the model is judged from the generalization error, which is evaluated on the never seen test set. While two different training algorithms (e.g. GD vs SGD) may achieve zero training error, they may differ in the level of generalization they typically reach.

Open questions and challenges. Building on the fundamental concepts presented in the previous paragraphs, practitioners managed to bring deep learning to unanticipated performances in the automatic processing of images, speech and text (see [LBH15] for a few years old review). Still, many of the greatest successes in the field of neural network were obtained using ingenious tricks while many fundamental theoretical questions remain unresolved.

Regarding the optimization first, (S)GD training generally discovers parameters close to zero risk. Yet, gradient descent is guaranteed to converge to the neighborhood of a global minimum only for a convex function and is otherwise expected to get stuck in a local minimum. Therefore, the efficiency of gradient-based optimization is a priori a paradox given the empirical risk $\hat{R}$ is non-convex in the parameters θ. Second, the generalization ability of deep neural networks trained by (S)GD is still poorly understood. The size of training data sets is limited by the cost of labeling by humans, experts or heavy computations. Thus training a deep and wide network amounts in practice to fitting a model of millions of degrees of freedom against a somehow relatively small amount of data points. Nevertheless it does not systematically lead to overfitting. The resulting neural networks can have surprisingly good predictions both on inputs seen during training and on new inputs [ZBH+17]. Results in the literature that relate the size and architecture of a network to a measure of its ability to generalize are too far from realistic settings to guide choices of practitioners. On the one hand, traditional bounds in statistics, considering worst cases, appear overly pessimistic [AAKZ19, BM02, SSBD14, Vap00]. On the other hand, historical statistical physics analyses of learning, briefly reviewed in section 6.1, only concern simple architectures and synthetic data. This lack of theory results in potentially important waste: in terms of time lost by engineers in trial and error to optimize their solution, and in terms of electrical resources used to train and re-train possibly oversized networks while storing potentially unnecessarily large training data sets.

The success of deep learning, beyond these apparent theoretical puzzles, certainly lies in the interplay of advantageous properties of training algorithms, the neural network hypothesis class and structures in typical data (e.g. real images, conversations). Disentangling the role of the different ingredients is a very active line of research (see [VBGS17] for a review).

2.2. Unsupervised learning

Density estimation and generative modeling. The goal of unsupervised learning is to directly extract structure from data. Compared to the supervised learning setting, the training data set is made of a set of example inputs $\mathcal{D}={\left\{{\underline{x}}_{\left(k\right)}\right\}}_{k=1}^{P}$ without corresponding outputs. A simple example of unsupervised learning is clustering, consisting in the discovery of unlabelled subgroups in the training data. Most unsupervised learning algorithms either implicitly or explicitly adopt a probabilistic viewpoint and implement density estimation. The idea is to approximate the true density $p\left(\underline{x}\right)$ from which the training data was sampled by the closest (in various senses) element among a family of parametrized distributions over the input space $\left\{{p}_{\theta }\left(.\right),\theta \in {\mathbb{R}}^{{N}_{\theta }}\right\}$. The selected pθ is then a model of the data. If the model pθ is easy to sample, it can be used to generate new inputs comparable to the training data points—which leads to the terminology of generative models. In this context, unsupervised deep learning exploits the representational power of deep neural networks to create sophisticated candidate pθ.

A common formalization of the learning objective is to maximize the likelihood, defined as the probability of i.i.d. draws from the model pθ to have generated the training data $\mathcal{D}={\left\{{\underline{x}}_{\left(k\right)}\right\}}_{k=1}^{P}$, or equivalently its logarithm,

Equation (12)

The second logarithmic additive formulation is generally preferred. It can be interpreted as the minimization of the Kullback–Leibler divergence between the empirical distribution ${p}_{\mathcal{D}}\left(\underline{x}\right)={\sum }_{k=1}^{P}\delta \left(\underline{x}-{\underline{x}}_{\left(k\right)}\right)/P$ and the model pθ:

Equation (13)

although considering the divergence with the discrete empirical measure is slightly abusive. The detail of the optimization algorithm here depends on the specification of pθ. As we will see, the likelihood in itself is often intractable and learning consists in a gradient ascent on at best a lower bound, otherwise an approximation, of the likelihood.

A few years ago, an alternative strategy called adversarial training was introduced by [GPAM+14]. Here an additional trainable model called the discriminator, for instance parametrized by ϕ and denoted dϕ(⋅), computes the probability for points in the input space $\mathcal{X}$ of belonging to the training set $\mathcal{D}$ rather than being generated by the model pθ(⋅). The parameters θ and ϕ are trained simultaneously such that, the generator learns to fool the discriminator and the discriminator learns not to be fooled by the generator. The optimization problem usually considered is

Equation (14)

where the sum of the expected log-probabilities according to the discriminator for examples in $\mathcal{D}$ to be drawn from $\mathcal{D}$ and examples generated by the model not to be drawn from $\mathcal{D}$ is maximized with respect to ϕ and minimized with respect to θ.

In the following, we present two classes of generative models based on neural networks.

Deep generative models. A deep generative models defines a density pθ obtained by propagating a simple distribution through a deep neural network. It can be formalized by introducing a latent variable $\underline{z}\in {\mathbb{R}}^{N}$ and a deep neural network hθ similar to (5) of input dimension N. The generative process is then

Equation (15)

Equation (16)

where pz is typically a factorized distribution on ${\mathbb{R}}^{N}$ easy to sample (e.g. a standard normal distribution), and ${p}_{\text{out}}\left(.\vert {h}_{\theta }\left(\underline{z}\right)\right)$ is for instance a multivariate Gaussian distribution with mean and covariance that are functions of ${h}_{\theta }\left(\underline{z}\right)$. The motivation to consider this class of models for joint distributions is three-fold. First the class is highly expressive. Second, it follows from the intuition that data sets leave on low dimensional manifolds, which here can be spaned by varying the latent representation $\underline{z}$ usually much smaller than the input space dimension (for further intuition see also the reconstruction objective of the first autoencoders, see e.g. chapter 14 of [GBC16]). Third, yet perhaps more importantly, the class can be optimized over easily using back-propagation, unlike the restricted Boltzmann machines presented in the next paragraph largely replaced by deep generative models. There are two main types of deep generative models. Generative adversarial networks (GAN) [GPAM+14] trained following the adversarial objective mentioned above, and variational autoencoders (VAE) [KW14, RMW14] trained to maximize a likelihood lower-bound.

Variational autoencoders. The computation of the likelihood of one training sample ${\underline{x}}_{\left(k\right)}$ for a deep generative model (15) and (16) requires then the marginalization over the latent variable $\underline{z}$,

Equation (17)

This multidimensional integral cannot be performed analytically in the general case. It is also hard to evaluate numerically as it does not factorize over the dimensions of $\underline{z}$ which are mixed by the neural network hθ. Yet a lower bound on the log-likelihood can be defined by introducing a tractable conditional distribution $q\left(\underline{z}\vert \underline{x}\right)$ that will play the role of an approximation of the intractable posterior distribution ${p}_{\theta }\left(\underline{z}\vert \underline{x}\right)$ implicitly defined by the model:

Equation (18)

Maximum likelihood learning is then approached by the maximization of the lower bound $\mathrm{L}\mathrm{B}\left(q,\theta ,\underline{x}\right)$, which requires in practice to parametrize the tractable posterior q = qϕ, typically with a neural network. Using the so-called re-parametrization trick [KW14, RMW14], the gradients of $\mathrm{L}\mathrm{B}\left({q}_{\phi },\theta ,\underline{x}\right)$ with respect to θ and ϕ can be approximated by a Monte Carlo, so that the likelihood lower bound can be optimized by gradient ascent.

Generative adversarial networks. The principle of adversarial training was designed directly for a deep generative model [GPAM+14]. Using a deep neural network to parametrize the discriminator dϕ(⋅) as well as the generator pθ(⋅), it leads to a remarkable quality of produced samples and is now one of the most studied generative model.

Restricted Boltzmann machines. Models described in the preceding paragraphs comprised only feed forward neural networks. In feed forward neural networks, the state or value of successive layers is determined following the recursion (7)–(9), in one pass from inputs to outputs. Boltzmann machines instead involve undirected neural networks which consist of stochastic neurons with symmetric interactions. The probability law associated with a neuron state is a function of neighboring neurons, themselves reciprocally function of the first neuron. Sampling a configuration therefore requires an equilibration in the place of a simple forward pass.

A restricted Boltzmann machine (RBM) [AHS85, Smo86] with M hidden neurons in practice defines a joint distribution over an input (or visible) layer $\underline{x}\in {\left\{0,1\right\}}^{N}$ and a hidden layer $\underline{t}\in {\left\{0,1\right\}}^{M}$,

Equation (19)

where $\mathcal{Z}$ is the normalization factor, similar to the partition function of statistical physics. The parametric density model over inputs is then the marginal ${p}_{\theta }\left(\underline{x}\right)={\sum }_{\underline{t}\in {\left\{0,1\right\}}^{M}}{p}_{\theta }\left(\underline{x},\underline{t}\right)$. Although seemingly very similar to pairwise Ising models, the introduction of hidden units provides a greater representative power to RBMs as hidden units can mediate interactions between arbitrary groups of input units. Furthermore, they can be generalized to deep Boltzmann machines (DBMs) [SH09], where several hidden layers are stacked on top of each other.

Identically to VAEs, RBMs can represent sophisticated distributions at the cost of an intractable likelihood. Indeed the summation over 2M+N terms in the partition function cannot be simplified by an analytical trick and is only realistically doable for small models. RBMs are commonly trained through a gradient ascent of the likelihood using approximated gradients. As exact Monte Carlo evaluation is a costly operation that would need to be repeated at each parameter update in the gradient ascent, several more or less sophisticated approximations are preferred: contrastive divergence (CD) [Hin02], its persistent variant (PCD) [Tie08] or even parallel tempering [CRI10, DCB+10].

RBMs were the first effective generative models using neural networks. They found applications in various domains including dimensionality reduction [HS06], classification [LB08], collaborative filtering [SMH07], feature learning [CNL11], and topic modeling [HS09]. Used for an unsupervised pre-training of deep neural networks layer by layer [BLPL07, HOT06], they also played a crucial role in the take-off of supervised deep learning.

Open questions and challenges. Generative models involving neural networks such as VAE, GANs and RBMs have great expressive powers at the cost of not being amenable to exact treatment. Their training, and sometimes even their sampling requires approximations. From a practical standpoint, whether these approximations can be either made more accurate or less costly is an open direction of research. Another important related question is the evaluation of the performance of generative models [SBL+18]. To start with the objective function of training is very often itself intractable (e.g. the likelihood of a VAE or an RBM), and beyond this objective, the unsupervised setting does not define a priori a test task for the performance of the model. Additionally, unsupervised deep learning inherits some of the theoretical puzzles already discussed in the supervised learning section. In particular, assessing the difficulty to represent a distribution and select a sufficient minimal model and/or training data set is an ongoing effort of research.

3. Statistical inference and the statistical physics approach

To tackle the open questions and challenges surrounding neural networks mentioned in the previous section, we need to manipulate high-dimensional probability distributions. The generic concept of statistical inference refers to the extraction of useful information from these complicated objects. Statistical physics, with its probabilistic interpretation of natural systems composed of many elementary components, is naturally interested in similar questions. We provide in this section a few concrete examples of inference questions arising in neural networks and explicit how statistical physics enters the picture. In particular, the theory of disordered systems appears here especially relevant.

3.1. Statistical inference

3.1.1. Some inference questions in neural networks for machine learning

Inference in generative models. Generative models used for unsupervised learning are statistical models defining high-dimensional distributions with complex dependencies. As we have seen in section 2.2, the most common training objective in unsupervised learning is the maximization of the log-likelihood, i.e. the log of the probability assigned by the generative model to the training set ${\left\{{\underline{x}}_{\left(k\right)}\right\}}_{k=1}^{P}$. Computing the probability of observing a given sample ${\underline{x}}_{\left(k\right)}$ is an inference question. It requires to marginalize over all the hidden representations of the problem. For instance in the RBM (19),

Equation (20)

While the numerator will be easy to evaluate, the partition function has no analytical expression and its exact evaluation requires to sum over all possible states of the network.

Learning as statistical inference: Bayesian inference and the teacher–student scenario. The practical problem of training neural networks from data as introduced in section 2 is not in general interpreted as inference. To do so, one needs to treat the learnable parameters as random variables, which is the case in Bayesian learning. For instance in supervised learning, an underlying prior distribution pθ(θ) for the weights and biases of a neural network (5) and (6) is assumed, so that Bayes rule defines a posterior distribution given the training data $\mathcal{D}$,

Equation (21)

Compared to the single output of risk minimization, we obtain an entire distribution for the learned parameters θ, which takes into account not only the training data but also some knowledge on the structure of the parameters (e.g. sparsity) through the prior. In practice, Bayesian learning and traditional empirical risk minimization may not be so different. On the one hand, the Bayesian posterior distribution is often summarized by a point estimate such as its maximum. On the other hand risk minimization is often biased toward desired properties of the weights through regularization techniques (e.g. promoting small norm) recalling the role of the Bayesian prior.

However, from a theoretical point of view, Bayesian learning is of particular interest in the teacher–student scenario. The idea here is to consider a toy model of the learning problem where parameters are effectively drawn from a prior distribution. Let us use as an illustration the case of the supervised learning of the perceptron model (3). We draw a weight vector ${\underline{w}}_{0}$, from a prior distribution pw(⋅), along with a set of P inputs ${\left\{{\underline{x}}_{\left(k\right)}\right\}}_{k=1}^{P}$ i.i.d. from a data distribution px(⋅). Using this teacher perceptron model we also draw a set of possibly noisy corresponding outputs y(k) from a teacher conditional probability $p\left(.\vert {\underline{w}}_{0}^{\top }{\underline{x}}_{\left(k\right)}\right)$. From the training set of the P pairs $\mathcal{D}=\left\{{\underline{x}}_{\left(k\right)},{y}_{\left(k\right)}\right\}$, one can attempt to rediscover the teacher rule by training a student perceptron model. The problem can equivalently be phrased as a reconstruction inference question: can we recover the value of ${\underline{w}}_{0}$ from the observations in $\mathcal{D}$? The Bayesian framework yields a posterior distribution of solutions

Equation (22)

Note that the terminology of teacher–student applies for a generic inference problem of reconstruction: the statistical model used to generate the data along with the realization of the unknown signal is called the teacher; the statistical model assumed to perform the reconstruction of the signal is called the student. When the two models are identical or matched, the inference is Bayes optimal. When the teacher model is not perfectly known, the statistical models can also be different (from slightly differing prior distributions to entirely different models), in which case they are said to be mismatched, and the reconstruction is suboptimal.

Of course in practical machine learning applications of neural networks, one has only access to an empirical distribution of the data and it is unclear whether there should exist a formal rule underlying the input–output mapping. Yet the teacher–student setting is a modeling strategy of learning which offers interesting possibilities of analysis and we shall refer to numerous works resorting to the setup in section 6.

3.1.2. Answering inference questions

Many inference questions in the line of the ones mentioned in the previous section have no tractable exact solution. When there exists no analytical closed-form, computations of averages and marginals require summing over configurations. Their number typically scales exponentially with the size of the system, then becoming astronomically large for high-dimensional models. Hence it is necessary to design approximate inference strategies. They may require an algorithmic implementation but must run in finite (typically polynomial) time. An important cross-fertilization between statistical physics and information sciences have taken place over the past decades around the questions of inference. Two major classes of such algorithms are Monte Carlo Markov chains (MCMC), and mean-field methods. The former is nicely reviewed in the context of statistical physics in [Kra06]. The latter will be the focus of this short review, in the context of deep learning.

Note that representations of joint probability distributions through probabilistic graphical models and factor graphs are crucial tools to design efficient inference strategies. In appendix B, we quickly introduce for the unfamiliar reader these two formalisms that enable to encode and exploit independencies between random variables. As examples, figure 2 presents graphical representations of the RBM measure (19) and the posterior distribution in the Bayesian learning of the perceptron as discussed in the previous section.

Figure 2.

Figure 2. (a) Undirected probabilistic graphical model (left) and factor graph representation (right). (b) Left: directed graphical model of the generative model for the training data knowing the teacher weight vector ${\underline{w}}_{0}$. Right: factor graph representation of the posterior distribution for the student $p\left(\underline{w}\vert \underline{\underline{X}},\underline{y}\right)$, where the vector $\underline{y}\in {\mathbb{R}}^{P}$ gathers the outputs y(k) and the matrix $\underline{\underline{X}}\in {\mathbb{R}}^{N{\times}P}$ gathers the inputs ${\underline{x}}_{\left(k\right)}$.

Standard image High-resolution image

3.2. Statistical physics of disordered systems, first appearance on stage

Here we re-introduce briefly fundamental concepts of statistical physics that will help to understand connections with inference and the origin of the methods presented in what follows.

The thermodynamic limit. The equilibrium statistics of classical physical systems are described by the Boltzmann distribution. For a system with N degrees of freedom noted $\underline{x}\in {\mathcal{X}}^{N}$ and an energy functional $E\left(\underline{x}\right)$, we have

Equation (23)

where we defined the partition function ${\mathcal{Z}}_{N}$ and the inverse temperature β. To characterize the macroscopic state of the system, an important functional is the free energy

Equation (24)

While the number of available configurations ${\mathcal{X}}^{N}$ grows exponentially with N, considering the thermodynamic limit N typically simplifies computations due to concentrations. Let eN = E/N be the energy per degree of freedom, the partition function can be re-written as a sum over the configurations of a given energy eN

Equation (25)

where we define fN(eN) the free energy density of states of energy eN. This rewriting implies that at large N the states of energy minimizing the free energy are exponentially more likely than any other states. Provided the following limits exist, the statistics of the system are dominated by the former states and we have the thermodynamic quantities

Equation (26)

The interested reader will also find a more detailed yet friendly presentation of the thermodynamic limit in section 2.4 of [MM09].

Disordered systems. Remarkably, the statistical physics framework can be applied to inhomogeneous systems with quenched disorder. In these systems, interactions are functions of the realization of some random variables. An iconic example is the Sherrington–Kirkpatrick (SK) model [SK75], a fully connected Ising model with random Gaussian couplings $\underline{\underline{J}}=\left({J}_{ij}\right)$, that is where the Jij are drawn independently from a Gaussian distribution. As a result, the energy functional of disordered systems is itself function of the random variables. For instance here, the energy of a spin configuration $\underline{x}$ is then $E\left(\underline{x};\underline{\underline{J}}\right)=-\frac{1}{2}{\underline{x}}^{\top }\underline{\underline{J}}\underline{x}$. In principle, system properties depend on a given realization of the disorder. In our example, the correlation between two spins ${\langle {x}_{i}{x}_{j}\rangle }_{J}$ certainly does. Yet some aggregated properties are expected to be self-averaging in the thermodynamic limit, meaning that they concentrate on their mean with respect to the disorder as the fluctuations are averaged out. It is the case for the free energy. As a result, here it formally verifies:

Equation (27)

(see e.g. [CCFM05, MPV86] for discussions of self-averaging in spin glasses). Thus the typical behavior of complex systems is studied in the statistical physics framework by taking two important conceptual steps: averaging over the realizations of the disorder and considering the thermodynamic limit. These are starting points to design approximate inference methods. Before turning to an introduction to mean-field approximations, we stress the originality of the statistical physics approach to inference.

Statistical physics of inference problems. Statistical inference questions are mapped to statistical physics systems by interpreting general joint probability distributions as Boltzmann distributions (23). Turning back to our simple examples of section 3.1, the RBM is trivially mapped as it directly borrows its definition from statistical physics. We have

Equation (28)

The inverse temperature parameter can either be considered equal to 1 or as a scaling factor of the weight matrix $\underline{\underline{W}}{\leftarrow}\beta \underline{\underline{W}}$ and bias vectors $\underline{a}{\leftarrow}\beta \underline{a}$ and $\underline{b}{\leftarrow}\beta \underline{b}$. The RBM parameters play the role of the disorder. Here the computational hardness in estimating the log-likelihood comes from the estimation of the log-partition function, which is precisely the free energy. In our second example, the estimation of the student perceptron weight vector, the posterior distribution is mapped to a Boltzmann distribution by setting

Equation (29)

The disorder is here materialized by the training data. The difficulty is here to compute $p\left(\underline{y}\vert \underline{\underline{X}}\right)$ which is again the partition function in the Boltzmann distribution mapping. Relying on the thermodynamic limit, mean-field methods will provide asymptotic results. Nevertheless, experience shows that the behavior of large finite-size systems are often well explained by the infinite-size limits.

Also, the application of mean-field inference requires assumptions about the distribution of the disorder which is averaged over. Practical algorithms for arbitrary cases can be derived with ad hoc assumptions, but studying a precise toy statistical model can also bring interesting insights. The simplest model in most cases is to consider uncorrelated disorder: in the example of the perceptron this corresponds to random input data points with arbitrary random labels. Yet, the teacher–student scenario offers many advantages with little more difficulty. It allows to create data sets with structure (the underlying teacher rule). It also allows to formalize an analysis of the difficulty of a learning problem and of the performance in the resolution. Intuitively, the definition of a ground-truth teacher rule with a fixed number of degrees of freedom sets the minimum information necessary to extract from the observations, or training data, in order to achieve perfect reconstruction. This is an information-theoretic limit.

Furthermore, the assumption of an underlying statistical model enables the measurement of performance of different learning algorithms over the class of corresponding problems from an average viewpoint. This is in contrast with the traditional approach of computer science in studying the difficulty of a class of problem based on the worst case. This conservative strategy yields strong guarantees of success, yet it may be overly pessimistic compared to the experience of practitioners. Considering a distribution over the possible problems (a.k.a. different realizations of the disorder), the average performances are sometimes more informative of typical instances rather than worst ones. For deep learning, this approach may prove particularly interesting as the traditional bounds, based on the VC-dimension [Vap00] and Rademacher complexity [BM02, SSBD14], appear extremely loose when compared to practical examples.

Finally, we must emphasize that derivations presented here are not mathematically rigorous. They are based on 'correct' assumptions allowing to push further the understanding of the problems at hand, while a formal proof of the assumptions is possibly much harder to obtain.

4. Selected overview of mean-field treatments: free energies and algorithms

Mean-field methods are a set of techniques enabling to approximate marginalized quantities of joint probability distributions by exploiting knowledge on the dependencies between random variables. They are usually said to be analytical—as opposed to numerical Monte Carlo methods. In practice they usually replace a summation exponentially large in the size of the system by an analytical formula involving a set of parameters, themselves solution of a closed set of non-linear equations. Finding the values of these parameters typically requires only a polynomial number of operations.

In this section, we will give a selected overview of mean-field methods as they were introduced in the statistical physics and/or signal processing literature. A key take away of what follows is that closely related results can be obtained from different heuristics of derivation. We will start by deriving the simplest and historically first mean-field method. We will then introduce the important broad techniques that are high-temperature expansions, message-passing algorithms and the replica method. In the following section 5 we will additionally cover the most recent extensions of mean-field methods presented in the present section 4 that are relevant to study learning problems.

4.1. Naive mean-field

The naive mean-field method is the first and somehow simplest mean-field approximation. It was introduced by the physicists Curie [Cur95] and Weiss [Wei07] and then adopted by the different communities interested in inference [WJ08].

4.1.1. Variational derivation

The naive mean-field method consists in approximating the joint probability distribution of interest by a fully factorized distribution. Therefore, it ignores correlations between random variables. Among multiple methods of derivation, we present here the variational method: it is the best known method across fields and it readily shows that, for any joint probability distribution interpreted as a Boltzmann distribution, the rather crude naive mean-field approximation yields an upper bound on the free energy. For the purpose of demonstration we consider a Boltzmann machine without hidden units (Ising model) with variables (spins) $\underline{x}=\left({x}_{1},\dots ,\;{x}_{N}\right)\in \mathcal{X}={\left\{0,1\right\}}^{N}$, and energy function

Equation (30)

where the notation (ij) stands for pairs of connected spin-variables, and the weight matrix $\underline{\underline{W}}$ is symmetric. The choices for {0, 1} rather than {−1, +1} for the variable values, the notations $\underline{\underline{W}}$ for weights (instead of couplings), $\underline{b}$ for biases (instead of local fields), as well as the vector notation, are leaning toward the machine learning conventions. We denote by ${q}_{\underline{m}}$ a fully factorized distribution on {0,1}N, which is a multivariate Bernoulli distribution parametrized by the mean values $\underline{m}=\left({m}_{1},\dots ,\;{m}_{N}\right)\in {\left[0,1\right]}^{N}$ of the marginals (denoted by ${q}_{{m}_{i}}$):

Equation (31)

We look for the optimal ${q}_{\underline{m}}$ distribution to approximate the Boltzmann distribution $p\left(\underline{x}\right)={\mathrm{e}}^{-\beta E\left(\underline{x}\right)}/\mathcal{Z}$ by minimizing the KL-divergence

Equation (32)

Equation (33)

Equation (34)

where the last inequality comes from the positivity of the KL-divergence. For a generic distribution q, G(q) is the Gibbs free energy for the energy $E\left(\underline{x}\right)$,

Equation (35)

involving the average energy U(q) and the entropy H(q). It is greater than the true free energy F except when q = p, in which case they are equal. Note that this fact also means that the Boltzmann distribution minimizes the Gibbs free energy. Restricting to factorized ${q}_{\underline{m}}$ distributions, we obtain the naive mean-field approximations for the mean value of the variables (or magnetizations) and the free energy:

Equation (36)

Equation (37)

The choice of a very simple family of distributions ${q}_{\underline{m}}$ limits the quality of the approximation but allows for tractable computations of observables, for instance the two-spins correlations ${\langle {x}_{i}{x}_{j}\rangle }_{{q}^{{\ast}}}={m}_{i}^{{\ast}}{m}_{j}^{{\ast}}$ or variance of one spin ${\langle {x}_{i}^{2}\rangle }_{{q}^{{\ast}}}-{\langle {x}_{i}\rangle }_{{q}^{{\ast}}}^{2}={m}_{i}^{{\ast}}-{{m}_{i}^{{\ast}}}^{2}$.

In our example of the Boltzmann machine, it is easy to compute the Gibbs free energy for the factorized ansatz, we define functions of the magnetization vector:

Equation (38)

Equation (39)

Equation (40)

Looking for stationary points we find a closed set of non linear equations for the ${m}_{i}^{{\ast}}$,

Equation (41)

where $\sigma \left(x\right)={\left(1+{\mathrm{e}}^{-x}\right)}^{-1}$. The solutions can be computed by iterating these relations from a random initialization until a fixed point is reached.

To understand the implication of the restriction to factorized distributions, it is instructive to compare this naive mean-field equation with the exact identity

Equation (42)

derived in a few lines in appendix C. Under the Boltzmann distribution $p\left(\underline{x}\right)={\mathrm{e}}^{-\beta E\left(\underline{x}\right)}/\mathcal{Z}$, these averages are difficult to compute. The naive mean-field method is neglecting the fluctuations of the effective field felt by the variable xi: j∈∂iWijxj, keeping only its mean j∈∂iWijmj. This incidentally justifies the name of mean-field methods.

4.1.2. When does naive mean-field hold true?

The previous derivation shows that the naive mean-field approximation allows to bound the free energy. While this bound is expected to be rough in general, the approximation is reliable when the fluctuations of the local effective fields j∈∂iWijxj are small. This may happen in particular in the thermodynamic limit N in infinite range models, that is when weights or couplings are not only local but distributed in the entire system, or if each variable interacts directly with a non-vanishing fraction of the whole set of variables (e.g. [OS01] section 2). The influence on one given variable of the rest of the system can then be treated as an average background. Provided the couplings are weak enough, the naive mean-field method may even become asymptotically exact. This is the case of the Curie–Weiss model, which is the fully connected version of the model (30) with all Wij = 1/N (see e.g. section 2.5.2 of [MM09]). The sum of weakly dependent variables then concentrates on its mean by the central limit theorem. We stress that it means that for finite dimensional models (more representative of a physical system, where for instance variables are assumed to be attached to the vertices of a lattice with nearest neighbors interactions), mean-field methods are expected to be quite poor. By contrast, infinite range models (interpreted as infinite-dimensional models by physicists) are thus traditionally called mean-field models.

In the next section we will recover the naive mean-field equations through a different method. The following derivation will also allow to compute corrections to the rather crude approximation we just discussed by taking into account some of the correlations it neglects.

4.2. Thouless Anderson and Palmer equations

The TAP mean-field equations [MH76, TAP77] were originally derived as an exact mean-field theory for the Sherrington–Kirkpatrick (SK) model [SK75]. The emblematic spin glass SK model we already mentioned corresponds to a fully connected Ising model with energy (30) and disordered couplings Wij drawn independently from a Gaussian distribution with zero mean and variance W0/N. The derivation of [TAP77] followed from arguments specific to the SK model. Later, it was shown that the same approximation could be recovered from a second order Taylor expansion at high temperature by Plefka [Ple82] and that it could be further corrected by the systematic computation of higher orders by Georges and Yedidia [GY99]. We will briefly present this last derivation, having again in mind the example of the generic Boltzmann machine (30).

4.2.1. Outline of the derivation

Going back to the variational formulation (34), we shall now perform a minimization in two steps. Consider first the family of distributions ${q}_{\underline{m}}$ enforcing ${\langle \underline{x}\rangle }_{{q}_{\underline{m}}}=\underline{m}$ for a fixed vector of magnetizations $\underline{m}$, but without any factorization constraint. The corresponding Gibbs free energy is

Equation (43)

A first minimization at fixed $\underline{m}$ over the ${q}_{\underline{m}}$ defines another auxiliary free energy

Equation (44)

A second minimization over $\underline{m}$ would recover the overall unconstrained minimum of the variational problem (34) which is the exact free energy

Equation (45)

Yet the actual value of ${G}_{\text{TAP}}\left(\underline{m}\right)$ turns out as complicated to compute as F itself. Fortunately, $\beta {G}_{\text{TAP}}\left(\underline{m}\right)$ can be easily approximated by a Taylor expansion around β = 0 due to interactions vanishing at high temperature, as noticed by Plefka, Georges and Yedidia [GY99, Ple82]. After expanding, the minimization over ${G}_{\text{TAP}}\left(\underline{m}\right)$ yields a set of self consistent equations on the magnetizations $\underline{m}$, called the TAP equations, reminiscent of the naive mean-field equations (41). Here again, the consistency equations are typically solved by iterations. Plugging the solutions ${\underline{m}}^{{\ast}}$ back into the expanded expression yields the TAP free energy ${F}_{\text{TAP}}={G}_{\text{TAP}}\left({\underline{m}}^{{\ast}}\right)$. Note that ultimately the approximation lies in the truncation of the expansion. At first order the naive mean-field approximation is recovered. Historically, the expansion was first stopped at the second order. This choice was model dependent, it results from the fact that the mean-field theory is already exact at the second order for the SK model [MH76, Ple82, TAP77].

4.2.2. Illustration on binary Boltzmann machines and important remarks

For the Boltzmann machine (30), the TAP equations and TAP free energy (truncated at second order) are [TAP77],

Equation (46)

Equation (47)

where the naive mean-field entropy HNMF was defined in (39). For this model, albeit with {+1, −1} variables instead of {0, 1}, several references pedagogically present the details of the derivation sketched in the previous paragraph. The interested reader should check in particular [OS01, Zam10]. We also present a more general derivation in appendix D, see section 4.2.3.

Onsager reaction term. Compared to the naive mean-field approximation the TAP equations include a correction to the effective field called the Onsager reaction term. The idea is that, in the effective field at variable i, we should consider corrected magnetizations of neighboring spins j ∈ ∂i, that would correspond to the absence of variable i. This intuition echoes at two other derivations of the TAP approximation: the cavity method [MPV86] that will not be covered here and the message passing which will be discussed in the next section.

As far as the SK model is concerned, this second order correction is enough in the thermodynamic limit as the statistics of the weights imply that higher orders will typically be subleading. Yet in general, the correct TAP equations for a given model will depend on the statistics of interactions and there is no guarantee that there exists a finite order of truncation leading to an exact mean-field theory. In section 5.2 we will discuss models beyond SK where a conjectured exact TAP approximation can be derived.

Single instance. Although the selection of the correct TAP approximation relies on the statistics of the weights, the derivation of the expansion outlined above does not require to average over them, i.e. it does not require an average over the disorder. Consequently, the approximation method is well defined for a single instance of the random disordered model and the TAP free energy and magnetizations can be computed for a given (realization of the) set of weights ${\left\{{W}_{ij}\right\}}_{\left(ij\right)}$ as explained in the following paragraph. In other words, it means that the approximation can be used to design practical inference algorithms in finite-sized problems and not only for theoretical predictions on average over the disordered class of models. Crucially, these algorithms may provide approximations of disorder-dependent observables, such as correlations, and not only of self averaging quantities.

Finding solutions. The self-consistent equations on the magnetizations (46) are usually solved by turning them into an iteration scheme and looking for fixed points. This generic recipe leaves nonetheless room for interpretation: which exact form should be iterated? How should the updates for the different equations be scheduled? Which time indexing should be used? While the following scheme may seem natural

Equation (48)

it typically has more convergence issues than the following alternative scheme including the time index t − 1

Equation (49)

This issue was discussed in particular in [Bol14, Kab03]. Remarkably, this last scheme, or algorithm, is actually the one obtained by the approximate message passing derivation that will be discussed in the upcoming section 4.3.

Solutions of the TAP equations. The TAP equations can admit multiple solutions with either equal or different TAP free energy. While the true free energy F corresponds to the minimum of the Gibbs free energy, reached for the Boltzmann distribution, the TAP derivation consists in performing an effectively unconstrained minimization in two steps, but with an approximation through a Taylor expansion in between. The truncation of the expansion therefore breaks the correspondence between the discovered minimizer and the unique Boltzmann distribution, hence the possible multiplicity of solutions. For the SK model for instance, the number of solutions of the TAP equations increases rapidly as β grows [MPV86]. While the different solutions can be accessed using different initializations of the iterative scheme, it is notably hard in phases where they are numerous to find exhaustively all the TAP solutions. In theory, they should be weighted according to their free energy density and averaged to recover the thermodynamics predicted by the replica computation [DY83], another mean-field approximation discussed in section 4.4.

4.2.3. Generalizing the Georges–Yedidia expansion

In the derivation outlined above for binary variables, xi = 0 or 1, the mean of each variable mi was fixed. This is enough to parametrize the corresponding marginal distribution ${q}_{{m}_{i}}\left({x}_{i}\right)$. Yet the expansion can actually be generalized to Potts variables (taking multiple discrete values) or even real valued variables by introducing appropriate parameters for the marginals. A general derivation fixing arbitrary real valued marginal distribution was proposed in appendix B of [LKZ17] for the problem of low rank matrix factorization. Alternatively, another level of approximation can be introduced for real valued variables by restricting the set of marginal distributions tested to a parametrized family of distributions. By choosing a Gaussian parametrization, one recovers TAP equations equivalent to the approximate message passing algorithm that will be discussed in the next section. In appendix D, we present a derivation for real-valued Boltzmann machines with a Gaussian parametrization as proposed in [TGM+18].

4.3. Belief propagation and approximate message passing

Another route to rediscover the TAP equations is through the approximation of message passing algorithms. Variations of the latter were discovered multiple times in different fields. In physics they were written in a restricted version as soon as 1935 by Bethe [Bet35]. In statistics, they were developed by Pearl as methods for probabilistic inference [Pea88]. In this section we will start by introducing a case-study of interest, the generalized linear model. We will then proceed by steps to outline the derivation of the approximate message passing (AMP) algorithm from the belief propagation (BP) equations.

4.3.1. Generalized linear model

Definition. We introduce the generalized linear model (GLM) which is a fairly simple model to illustrate message passing algorithms and which is also an elementary brick for a large range of interesting inference questions on neural networks. It falls under the teacher–student set up: a student model is used to reconstruct a signal from a teacher model producing indirect observations. In the GLM, the product of an unknown signal ${\underline{x}}_{0}\in {\mathbb{R}}^{N}$ and a known weight matrix $\underline{\underline{W}}\in {\mathbb{R}}^{N{\times}M}$ is observed as $\underline{y}$ through a noisy channel pout,0,

Equation (50)

The probabilistic graphical model corresponding to this teacher is represented in figure 3. The prior over the signal ${p}_{{x}_{0}}$ is supposed to be factorized, and the channel pout,0 likewise. The inference problem is to produce an estimator $\hat{\underline{x}}$ for the unknown signal ${\underline{x}}_{0}$ from the observations $\underline{y}$. Given the prior px and the channel pout of the student, not necessarily matching the teacher, the posterior distribution is

Equation (51)

Equation (52)

represented as a factor graph also in figure 3. The difficulty of the reconstruction task of ${\underline{x}}_{0}$ from $\underline{y}$ is controlled by the measurement ratio α = M/N and the amplitude of the noise possibly present in the channel.

Figure 3.

Figure 3. Graphical representations of the generalized linear model. Left: probabilistic graphical model of the teacher. Middle left: factor graph representation of the posterior distribution on the signal $\underline{x}$ under the student statistical model. Middle right and right: belief propagation updates (67) and (68) for approximate inference.

Standard image High-resolution image

Applications. The generic GLM underlies a number of applications. In the context of neural networks of particular interest in this technical review, the channel pout generating observations $\underline{y}\in {\mathbb{R}}^{M}$ can equivalently be seen as a stochastic activation function $g\left(\cdot ;\underline{{\epsilon}}\right)$ incorporating a noise $\underline{{\epsilon}}\in {\mathbb{R}}^{M}$ component-wise to the output,

Equation (53)

The inference of the teacher signal in a GLM has then two possible interpretations. On the one hand, it can be interpreted as the reconstruction of the input $\underline{x}$ of a stochastic single-layer neural network from its output $\underline{y}$. For example, this inference problem can arise in the maximum likelihood training of a one-layer VAE (see corresponding paragraph in section 2.2). On the other hand, the same question can also correspond to the Bayesian learning of a single-layer neural network with a single output—the perceptron—where this time $\left\{\underline{\underline{W}},\underline{y}\right\}$ are interpreted as the collection of training input–output pairs and ${\underline{x}}_{0}$ plays the role of the unknown weight vector of the teacher (as cited as an example in section 3.1.1). However, note that one of the most important applications of the GLM, compressed sensing (CS) [Don06], does not involve neural networks.

Statistical physics treatment, random weights and scaling. From the statistical physics perspective, the effective energy functional is read from the posterior (52) seen as a Boltzmann distribution with energy

Equation (54)

The inverse temperature β has here no formal equivalent and can be thought as being equal to 1. The energy is a function of the random realizations of $\underline{\underline{W}}$ and $\underline{y}$, playing the role of the disorder. Furthermore, the validity of the approximation presented below require additional assumptions. Crucially, the weight matrix is assumed to have i.i.d. Gaussian entries with zero mean and variance 1/N, much like in the SK model. The prior of the signal is chosen so as to ensure that the xis (and consequently the yμs) remain of order 1. Finally, the thermodynamic limit N is taken for a fixed measurement ratio α = M/N.

4.3.2. Belief propagation

Recall that inference in high-dimensional problems consists in marginalizations over complex joint distributions, typically in the view of computing partition functions, averages or marginal probabilities for sampling. Belief propagation (BP) is an inference algorithm, sometimes exact and sometimes approximate as we will see, leveraging the known factorization of a distribution, which encodes the precious information of (in)depencies between the random variables in the joint distribution. For a generic joint probability distribution p over $\underline{x}\in {\mathbb{R}}^{N}$ factorized as

Equation (55)

ψμ are called potential functions taking as arguments the variables xis involved in the factor μ shortened as ${\underline{x}}_{\partial \mu }$.

Definition of messages. Let us first write the BP equations and then explain the origin of these definitions. The underlying factor graph of (55) has N nodes carrying variables xis and M factors associated with the potential functions ψμs (see appendix B for a quick reminder). BP acts on messages variables which are tied to the edges of the factor graph. Specifically, the sum-product version of the algorithm (as opposed to the max-sum, see e.g. [MM09]) consists in the update equations

Equation (56)

Equation (57)

where again the is index the variable nodes and the μs index the factor nodes. The notation ∂μ\i designate the set of neighbor variables of the factor μ except the variable i (and reciprocally for ∂i\μ). The partition functions ${\mathcal{Z}}_{i\to \mu }$ and ${\mathcal{Z}}_{\mu \to i}$ are normalization factors ensuring that the messages can be interpreted as probabilities.

For acyclic (or tree-like) factor graphs, the BP updates are guaranteed to converge to a fixed point, that is a set of time independent messages $\left\{{m}_{i\to \mu },{\tilde {m}}_{\mu \to i}\right\}$ solution of the system of equations (56) and (57). Starting at a leaf of the tree, these messages communicate beliefs of a given node variable taking a given value based on the nodes and factors already visited along the tree. More precisely, ${\tilde {m}}_{\mu \to i}\left({x}_{i}\right)$ is the marginal probability of xi in the factor graph before visiting the factors in ∂i except for μ, and miμ(xi) is equal the marginal probability of xi in the factor graph before visiting the factor μ, see figure 4.

Figure 4.

Figure 4. Representations of the neighborhood of edge iμ in the factor graph and corresponding local BP updates. Factors are represented as squares and variable nodes as circles. Left: in the factor graph where all factors around xi are removed (in gray) except for the factor μ, the marginal of xi (in red) is updated from the messages incoming at factor μ (in blue) (56). Right: in the factor graph where factor μ is deleted (in gray), the marginal of xi (in blue) is updated with the incoming messages (in red) from the rest of the factors (57).

Standard image High-resolution image

Thus, at convergence of the iterations, the marginals can be computed as

Equation (58)

which can be seen as the main output of the BP algorithm. These marginals will only be exact on trees where incoming messages, computed from different part of the graph, are independent. Nonetheless, the algorithm (56) and (57), occasionally then called loopy-BP, can sometimes be converged on graphs with cycles and in some cases will still provide high quality approximations. For instance, graphs with no short loops are locally tree like and BP is an efficient method of approximate inference, provided correlations decay with distance (i.e. incoming messages at each node are still effectively independent). BP will also appear principled for some infinite range mean-field models previously discussed; an example of which being our case-study the GLM discussed below. While this is the only example that will be discussed here in the interest of conciseness, getting fluent in BP generally requires more than one application. The interested reader could also consult [YFW02] and [MM09] section 14.1. for simple concrete examples.

The Bethe free energy. The BP algorithm can also be recovered from a variational argument. Let us consider both the single variable marginals mi(xi) and the marginals of the neighborhood of each factor ${\tilde {m}}_{\mu }\left({\underline{x}}_{\partial \mu }\right)$. On tree graphs, the joint distribution (55) can be re-expressed as

Equation (59)

where ni is the number of neighbor factors of the ith variable. Abusively, we can use this form as an ansatz for loopy graph and plug it in the Gibbs free energy to derive an approximation of the free energy, similarly to the naive mean-field derivation of section 4.1. This time the variational parameters will be the distributions mi and ${\tilde {m}}_{\mu }$ (see e.g. [MM09, YFW02] for additional details). The corresponding functional form of the Gibbs free energy is called the Bethe free energy:

Equation (60)

where H(q) is the entropy of the distribution q. Optimization of the Bethe free energy with respect to its arguments under the constraint of consistency

Equation (61)

involves Lagrange multipliers which can be shown to be related to the messages defined in (56) and (57). Eventually, one can verify that marginals defined as (58) and

Equation (62)

are stationary point of the Bethe free energy for messages that are BP solutions. In other words, the BP fixed points are consistent with the stationary point of the Bethe free energy. Using the normalizing constants of the messages, the Bethe free energy can also be re-written as

Equation (63)

with

Equation (64)

Equation (65)

Equation (66)

As for the marginals, the Bethe free energy, will only be exact if the underlying factor graph is a tree. Otherwise it is an approximation of the free energy, that is not generally an upper bound.

Belief propagation for the GLM. The writing of the BP-equations for our case-study is schematized on the right of figure 3. There are 2 × N × M updates:

Equation (67)

Equation (68)

for all iμ pairs. Despite a relatively concise formulation, running BP in practice turns out intractable since for a signal $\underline{x}$ taking continuous values it would entail keeping track of distributions on continuous variables. In this case, BP is approximated by the (G)AMP algorithm presented in the next section.

4.3.3. (Generalized) approximate message passing

The name of approximate message passing (AMP) was fixed by Donoho et al [DMM09] who derived the algorithm in the context of compressed sensing. Several works from statistical physics had nevertheless already proposed related algorithmic procedures and made connections with the TAP equations for different systems [Kab03, KS98, OS01]. The algorithm was derived systematically for any channel of the GLM by Rangan [Ran11] and became generalized-AMP (GAMP), yet again it seems that [KU04] proposed the first generalized derivation.

The systematic procedure to write AMP for a given joint probability distribution consists in first writing BP on the factor graph, second project the messages on a parametrized family of functions to obtain the corresponding relaxed-BP and third close the equations on a reduced set of parameters by keeping only leading terms in the thermodynamic limit. We will quickly review and justify these steps for the GLM. Note that here a relevant algorithm for approximate inference will be derived from message passing on a fully connected graph of interactions. As it tuns out, the high connectivity limit and the introduction of short loops does not break the assumption of independence of incoming messages in this specific case thanks to the small scale $O\left(1/\sqrt{N}\right)$ and the independence of the weight entries. The statistics of the weights are here crucial.

Relaxed belief propagation. In the thermodynamic limit M, N → +, one can show that the scaling $1/\sqrt{N}$ of the Wij and the extensive connectivity of the underlying factor graph imply that messages are approximately Gaussian. Without giving all the details of the computation which can be cumbersome, let us try to provide some intuitions. We drop the time indices for simplicity and start with (67). Consider the intermediate reconstruction variable ${z}_{\mu }={\underline{w}}_{\mu }^{\top }\underline{x}={\sum }_{{i}^{\prime }\ne i}{W}_{\mu {i}^{\prime }}{x}_{{i}^{\prime }}+{W}_{\mu i}{x}_{i}$. Under the statistics of the messages mi'→μ(xi'), the xi' are independent such that by the central limit theorem zμWμixi is a Gaussian random variable with respectively mean and variance

Equation (69)

Equation (70)

where we defined the mean and the variance of the messages mi'→μ(xi'),

Equation (71)

Equation (72)

Using these new definitions, (67) can be rewritten as

Equation (73)

where the notation ∝ omits the normalization factor for distributions. Considering that Wμi is of order $1/\sqrt{N}$, the development of (73) shows that at leading order ${\tilde {m}}_{\mu \to i}\left({x}_{i}\right)$ is Gaussian:

Equation (74)

where the details of the computations yield

Equation (75)

Equation (76)

using the output update functions

Equation (77)

Equation (78)

Equation (79)

These arguably complicated functions, again coming out of the development of (73), can be interpreted as the estimation of the mean and the variance of the gap between two different estimate of zμ considered by the algorithm: the mean estimate ωμi given incoming messages mi'→μ(xi') and the same mean estima-te updated to incorporate the information coming from the channel pout and observation yμ. Finally, the Gaussian parametrization (74) of ${\tilde {m}}_{\mu \to i}\left({x}_{i}\right)$ serves to rewrite the other type of messages miμ(xi) (68),

Equation (80)

with

Equation (81)

Equation (82)

The set of equations can finally be closed by recalling the definitions (71) and (72):

Equation (83)

Equation (84)

with now the input update functions

Equation (85)

Equation (86)

Equation (87)

The input update functions can be interpreted as updating the estimation of the mean and variance of the signal xi based on the information coming from the incoming messages grasped by λiμ and σiμ with the information of the prior px.

To sum-up, by considering the leading order terms in the thermodynamic limit, the BP equations can be self-consistently re-written as a closed set of equations over mean and variance variables (69), (70), (75), (76), (81)–(84). Eventually, r-BP can equivalently be thought of as the projection of BP onto the following parametrizations of the messages

Equation (88)

Equation (89)

Note that, at convergence, an approximation of the marginals is recovered from the projection on the parametrization (89) of (58),

Equation (90)

Equation (91)

Equation (92)

Nonetheless, r-BP is scarcely used as such as the computational cost can be readily reduced with little more approximation. Because the parameters in (88) and (89) take the form of messages on the edges of the factor graph there are still O(M × N) quantities to track to solve the self-consistency equations by iterations. Yet, in the thermodynamic limit, the messages are closely related to the marginals as the contribution of the missing message between (57) and (58) is to a certain extent negligible. Careful book keeping of the order of contributions of these small differences leads to a set of closed equations on parameters of the marginals, i.e. O(N) variables, corresponding to the GAMP algorithm.

A detailed derivation and developed algorithm of r-BP for the GLM can be found for example in [ZK16] (section 6.3.1). In section 5.3 of the present paper, we also present the derivation in a slightly more general setting where the variables xi and yμ are possibly vectors instead of scalars.

Generalized approximate message passing. The GAMP algorithm with respect to marginal parameters, analogous to the messages parameters introduced above (summarized in (88) and (89)), is given in algorithm 1. The origin of GAMP is again the development of the r-BP message-like equations around marginal quantities. The details of this derivation for the GLM can be found for instance in [ZK16] (section 6.3.1). For a random initialization, the algorithm can be decomposed in 4 steps per iteration which refine the estimate of the signal $\underline{x}$ and the intermediary variable $\underline{z}$ by incorporating the different sources of information. Steps (2) and (4) involve the update functions relative to the prior and output channel defined above. Steps (1) and (3) are general for any GLM with a random Gaussian weight matrix, as they result from the consistency of the two alternative parametrizations introduced for the same messages in (88) and (89).

Algorithm 1. Generalized approximate message passing

Input: vector $\underline{y}\in {\mathbb{R}}^{M}$ and matrix $\underline{\underline{W}}\in {\mathbb{R}}^{M{\times}N}$:
Initialize: ${\hat{x}}_{i}$, ${C}_{i}^{x}\quad \forall i$ and ${{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}_{\mu }$, ${{\partial }_{\omega }{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}_{\mu }\quad \forall \mu $
repeat
(1) Estimate mean ${\omega }_{\mu }^{\left(t\right)}$ and variance ${V}_{\mu }^{\left(t\right)}$ of zμ given current ${\hat{\underline{x}}}^{\left(t\right)}$
Equation (93)
Equation (94)
(2) Estimate mean ${{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}_{\mu }^{\left(t\right)}$ and variance ${{\partial }_{\omega }{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}_{\mu }^{\left(t\right)}$ of the gap between optimal zμ and ${\omega }_{\mu }^{\left(t\right)}$ given yμ
Equation (95)
Equation (96)
(3) Estimate mean $\underline{\lambda }$ and variance $\underline{\sigma }$ of $\underline{x}$ given current gap between optimal $\underline{z}$ and $\underline{\omega }$
Equation (97)
Equation (98)
(4) Estimate of mean ${\hat{\underline{x}}}^{\left(t+1\right)}$ and ${{\underline{C}}^{x}}^{\left(t+1\right)}$ variance of $\underline{x}$ updated with information from the prior
Equation (99)
Equation (100)
until convergence

Relation to TAP equations. Historically the main difference between the AMP algorithm and the TAP equations is that the latter was first derived for binary variables with two-body interactions (SK model) while the former was proposed for continuous random variables with N-body interactions (compressed sensing). The details of the derivation (described in [ZK16] or in a more general case in section 5.3), rely on the knowledge of the statistics of the disordered variable $\underline{\underline{W}}$ but do not require a disorder average, as in the Georges–Yedidia expansion yielding the TAP equations. By focusing on the GLM with a random Gaussian weight matrix scaling as $O\left(1/\sqrt{N}\right)$ (similarly to the couplings of the SK model) we naturally obtained TAP equations at second order, with an Onsager term in the update (94) of ωμ. Yet an advantage of the AMP derivation from BP over the high-temperature expansion is that it explicitly provides 'correct' time indices in the iteration scheme to solve the self consistent equations [Bol14].

Reconstruction with AMP. AMP is therefore a practical reconstruction algorithm which can be run on a single instance (the disorder is not averaged) to estimate an unknown signal ${\underline{x}}_{0}$. Note that the prior px and channel pout used in the algorithm correspond to the student statistical model and they may be different from the true underlying teacher model that generates ${\underline{x}}_{0}$ and $\underline{y}$. In other words, the AMP algorithm may be used either in the Bayes optimal or in the mismatched setting defined in section 3.1.1. Remarkably, it is also possible to consider a disorder average in the thermodynamic limit to study the average case computational hardness, here of the GLM inference problem, in either of these matched or mismatched configurations.

State evolution. The statistical analysis of the AMP equations for compressed sensing in the average case and in the thermodynamic limit N lead to another closed set of equations that was called state evolution (SE) in [DMM09]. Such an analysis can be generalized to other problems of application of approximate message passing algorithms. The derivation of SE starts from the r-BP equations and relies on the assumption of independent incoming messages to invoke the central limit theorem. It is therefore only necessary to follow the evolution of a set of means and variances parametrizing Gaussian distributions. When the different variables and factors are statistically equivalent, as it is the case of the GLM, SE reduces to a few scalar equations. The interested reader should refer to appendix F for a detailed derivation in a more general setting.

Mismatched setting. In the general mismatched setting we need to carefully differentiate the teacher and the student. We note ${p}_{{x}_{0}}$ the prior used by the teacher. We also rewrite its channel ${p}_{\mathrm{o}\mathrm{u}\mathrm{t},0}\left(y\vert {\underline{w}}^{\top }\underline{x}\right)$ as the explicit function $y={g}_{0}\left({\underline{w}}^{\top }\underline{x};{\epsilon}\right)$ assuming the noise epsilon to be distributed according to the standard normal distribution. The tracked quantities are the overlaps,

Equation (101)

along with the auxiliary V, $\hat{q}$, $\hat{m}$ and $\hat{\chi }$:

Equation (102)

Equation (103)

Equation (104)

Equation (105)

Equation (106)

Equation (107)

where we use the notation $\mathcal{N}\left(\cdot ;\cdot ,\cdot \right)$ for the normal distribution, $\mathcal{D}\xi \;$ for the standard normal measure and the covariance matrix ${\underline{\underline{Q}}}^{\left(t\right)}$ is given at each time step by

Due to the self-averaging property, the performance of the reconstruction by the AMP algorithm on an instance of size N can be tracked along the iterations given

Equation (108)

with only minor differences coming from finite-size effects. State evolution also provides an efficient procedure to study from the theoretical perspective the AMP fixed points for a generic model, such as the GLM, as a function of some control parameters. It reports the average results for running the complete AMP algorithm on O(N) variables using a few scalar equations. Furthermore, the state evolution equations simplify further in the Bayes optimal setting.

Bayes optimal setting. When the prior and channel are identical for the student and the teacher, the true unknown signal ${\underline{x}}_{0}$ is in some sense statistically equivalent to the estimate $\hat{\underline{x}}$ coming from the posterior. More precisely one can prove the Nishimori identities [Iba99, Nis01, OH91] (or [KKM+16] for a concise demonstration and discussion) implying that q = m, V = q0m and $\hat{q}=\hat{m}=\hat{\chi }$. Only two equations are then necessary to track the performance of the reconstruction:

Equation (109)

Equation (110)

4.4. Replica method

Another powerful technique from the statistical physics of disordered systems to examine models with infinite range interactions is the replica method. It enables an analytical computation of the quenched free energy via non-rigorous mathematical manipulations. More developed introductions to the method can be found in [CCFM05, MPV86, Nis01].

4.4.1. Steps of a replica computation

The basic idea of the replica computation is to compute the average over the disorder of $\mathrm{log}\;\mathcal{Z}$ by considering the identity $\mathrm{log}\;\mathcal{Z}={\mathrm{lim}}_{n\to 0}\left({\mathcal{Z}}^{n}-1\right)/n$. First the expectation of ${\mathcal{Z}}^{n}$ is evaluated for $n\in \mathbb{N}$, then the n → 0 limit is taken by 'analytic continuation'. Thus the method takes advantage of the fact that the average of a power of $\mathcal{Z}$ is sometimes easier to compute than the average of a logarithm. We illustrate the key steps of the calculation for the partition function of the GLM (52).

Disorder average for the replicated system: coupling of the replicas. The average of ${\mathcal{Z}}^{n}$ for $n\in \mathbb{N}$ can be seen as the partition function of a system with n + 1 non interacting replicas of $\underline{x}$ indexed by a ∈ {0, ..., n}, where the first replica a = 0 is representative of the teacher and the n other replicas are identically distributed as the student:

Equation (111)

Equation (112)

Equation (113)

To perform the average over the disordered interactions $\underline{\underline{W}}$ we consider the statistics of ${\underline{z}}_{a}=\underline{\underline{W}}{\underline{x}}_{a}$. Recall that ${W}_{\mu i}\sim \mathcal{N}\left({W}_{\mu i};0,1/N\right)$, independently for all μ and i. Consequently, the ${\underline{z}}_{a}$ are jointly Gaussian in the thermodynamic limit with means and covariances

Equation (114)

The overlaps, that we already introduced in the SE formalism, naturally re-appear. We introduce the notation $\underline{\underline{q}}$ for the (n + 1) × (n + 1) overlap matrix. Integrating out the disorder $\underline{\underline{W}}$ shared by the n + 1 replicas will therefore leave us with an effective system of now coupled replicas:

Equation (115)

Change of variable for the overlaps: decoupling of the variables. We consider the Fourier representation of the Dirac distribution fixing the consistency between overlaps and replicas,

Equation (116)

where ${\hat{q}}_{ab}$ is purely imaginary, which yields

Equation (117)

where $C\left(\underline{\underline{q}},n\right)$ is related to the normalization of the Gaussian distributions over the ${\underline{z}}_{a}$ variables, and the integrals can be factorized over the is and μs. Thus we obtain

Equation (118)

with

Equation (119)

Equation (120)

where we introduce the notation $\hat{\underline{\underline{q}}}$ for the auxiliary overlap matrix with entries ${\left(\hat{\underline{\underline{q}}}\right)}_{ab}={\hat{q}}_{ab}$ and we omitted the factor 2 which is eventually subleading as N → +. The decoupling of the xi and the zμ of the infinite range system yields pre-factors N and M in the exponential arguments. In the thermodynamic limit, we recall that both N and M tend to + while the ratio α = M/N remains fixed. Hence, the integral for the replicated average is easily computed in this limit by the saddle point method:

Equation (121)

where we defined the replica potential ϕ.

Exchange of limits: back to the quenched average. The thermodynamic average of the log-partition is recovered through an a priori risky mathematical manipulation: (i) perform an analytical continuation from $n\in \mathbb{N}$ to n → 0

Equation (122)

and (ii) exchange limits

Equation (123)

Despite the apparent lack of rigor in taking these last steps, the replica method has been proven to yield exact predictions in the thermodynamic limit for different problems and in particular for the GLM [BKM+18, RP16].

Saddle point solution: choice of a replica ansatz. At this point, we are still left with the problem of computing the extrema of $\mathcal{\phi }\left(\underline{\underline{q}},\hat{\underline{\underline{q}}}\right)$. To solve this optimization problem over $\underline{\underline{q}}$ and $\hat{\underline{\underline{q}}}$, a natural assumption is that replicas, that are a pure artifact of the calculation, are equivalent. This is reflected in a special structure for overlap matrices between replicas that only depend on three parameters each,

Equation (124)

here given as an example for n = 3 replicas. Plugging this replica symmetric (RS) ansatz in the expression of $\mathcal{\phi }\left(\underline{\underline{q}},\hat{\underline{\underline{q}}}\right)$, taking the limit n → 0 and looking for the stationary points as a function of the parameters q, m, q12 and $\hat{m}$, $\hat{q}$, ${\hat{q}}_{12}$ recovers a set of equations equivalent to SE (7), albeit without time indices. Hence the two a priori different heuristics of BP and the replica method are remarkably consistent under the RS assumption.

Nevertheless, the replica symmetry can be spontaneously broken in the large N limit and the dominating saddle point does not necessarily correspond to the RS overlap matrix. This replica symmetry breaking (RSB) corresponds to substantial changes in the structure of the examined Boltzmann distribution. It is among the great strengths of the replica formalism to naturally capture it. Yet for inference problems falling under the teacher–student scenario, the correct ansatz is always replica symmetric in the Bayes optimal setting [CCFM05, Nis01, ZK16], and we will not investigate here this direction further. The interested reader can refer to the classical references for an introduction to replica symmetry breaking [CCFM05, MPV86, Nis01] in the context of the theory of spin-glasses.

Bayes optimal setting. As in SE the equations simplify in the matched setting, where the first replica corresponding to the teacher becomes equivalent to all the others. The replica free energy of the GLM is then given as the extremum of a potential over two scalar variables:

Equation (125)

Equation (126)

Equation (127)

The saddle point equations corresponding to the extremization (125), fixing the values of q and $\hat{q}$, would again be found equivalent to the Bayes optimal SE (109) and (110). This Bayes optimal result is derived in [KMS+12] for the case of a linear channel and Gauss–Bernoulli prior, and can also be recovered as a special case of the low-rank matrix factorization formula (where the measurement matrix is in fact known) [KKM+16].

4.4.2. Assumptions and relation to other mean-field methods

A crucial point in the above derivation of the replica formula is the extensivity of the interactions of the infinite range model that allowed the factorization of the N scaling of the argument of the exponential integrand in (118). The statistics of the disorder $\underline{\underline{W}}$ and in particular the independence of all the Wμi was also necessary. While this is an important assumption for the technique to go through, it can be possible to relax it for some types of correlation statistics, as we will see in section 5.2.

Note that the replica method directly enforces the disorder averaging and does not provide a prediction at the level of the single instance. Therefore it cannot be turned into a practical algorithm of reconstruction. Nonetheless, we have seen that the saddle point equations of the replica derivation, under the RS assumption, matches the SE equations derived from BP. This is sufficient to theoretically study inference questions under a teacher–student scenario in the Bayes optimal setting, and in particular predict the MSE following (108).

In the mismatched setting however, the predictions of the replica method under the RS assumption and the equivalent BP conclusions can be wrong. By introducing the symmetry breaking between replicas, the method can sometimes be corrected. It is an important endeavor of the replica formalism to grasp the importance of the overlaps and connect the form of the replica ansatz to the properties of the joint probability distribution examined. When BP fails on loopy graphs, correlations between variables are not decaying with distance, which manifests into an RSB phase. Note that there also exists message passing algorithms operating in this regime [AFUZ19, AKUZ19, MM09, MP01, MPZ02, SLL19].

5. Further extensions of interest for learning

In the previous section we presented the classical mean-field approximations focusing on the simple and original examples of the Boltzmann machine (a.k.a. SK model) and the GLM with Gaussian i.i.d. weight matrices. Along the way, we tried to emphasize how the procedures of approximation rely on structural (e.g. connectivity) and statistical properties of the model under scrutiny. In the present section, we will see that extensions of the message passing and replica methods have now broadened the span of applicability of mean-field approximations. We focus on a selection of recent developments of particular interest to study learning problems.

5.1. Streaming AMP for online learning

In learning applications, it is sometimes advantageous for speed or generalization concerns to only treat a subset of examples at the time—making for instance the SGD algorithm the most popular training algorithm in deep learning. Sometimes also, the size of the current data sets may exceed the available memory. Methods implementing a step-by-step learning as the data arrives are referred to as online, streaming or mini-batch learning, as opposed to offline or batch learning.

In [MKTZ18], a mini-batch version of the AMP algorithm is proposed. It consists in a generalization of assumed density filtering [OW99a, RKI16] that are fully-online, meaning that only a single example is received at once, or mini-batches have size 1. The general derivation principle is the same. On the example of the GLM, one imagines receiving at each iteration a subset of the components of $\underline{y}$ to reconstruct $\underline{x}$. We denote by ${\underline{y}}_{\left(k\right)}$ these successive mini-batches. Bayes formula gives the posterior distribution over $\underline{x}$ after seeing k mini-batches

Equation (128)

This formula suggests the iterative scheme of using as a prior on $\underline{x}$ at iteration k the posterior obtained at iteration k − 1. This idea can be implemented in different approximate inference algorithms, as also noticed by [BBW+13] using a variational method. In the regular version of AMP an effective factorized posterior is given at convergence by the input update functions (85)–(87):

Equation (129)

Plugging this posterior approximation in the iterative scheme yields the mini-AMP algorithm using the converged values of ${{\lambda }_{\left(\ell \right)}}_{i}$ and ${{\sigma }_{\left(\ell \right)}}_{i}$ at each anterior mini-batch < k to compute the prior

Equation (130)

where the ${\mathcal{Z}}_{x,i}$ normalize each marginal factor. Compared to a naive mean-field variational approximation of the posterior, AMP takes into account more correlations and is indeed found to perform better in experiments reported by [MKTZ18]. Another advantage of the AMP based online inference is that it is amenable to theoretical analysis by a corresponding state evolution [MKTZ18, OW99a, RKI16].

5.2. Algorithms and free energies beyond i.i.d. matrices

The derivations outlined in the previous sections of the equivalent replica, TAP and AMP equations required the weight matrices to have Gaussian i.i.d. entries. In this case, rigorous proofs of asymptotic exactness of the mean-field solutions were found, for the SK model [Tal06] and the GLM [BKM+18, RP16]. Mean-field inference with different weight statistics is a priori feasible if one finds a way either to perform the corresponding disorder average in the replica computation, to evaluate the corresponding Onsager correction in the TAP equations, or to write a message passing where messages remain uncorrelated (even in the high-connectivity limit we may be interested in).

Efforts to broaden in practice the class of matrices amenable to such mean-field treatments lead to a series of works in statistical physics and signal processing with related propositions. Parisi and Potters pioneered this direction by deriving mean-field equations for orthogonal weight matrices using a high-temperature expansion [PP95]. The adaptive TAP approach proposed by Opper and Winther [OW01a, OW01b] further allowed for inference in densely connected graphical models without prior knowledge on the weight statistics. The Onsager term of these TAP equations was evaluated using the cavity method for a given weight sample. The resulting equations were then understood to be a particular case of the expectation propagation (EP) [Min01]—belonging to the class of message passing algorithms for approximate inference—yet applied in densely connected models [OW05]. An associated approximation of the free energy called expectation consistency (EC) was additionally derived from the EP messages. Subsequently, Kabashima and collaborators [Kab08, SK08, SK09] focused on the perceptron and the GLM to propose TAP equations and a replica derivation of the free energy for the ensemble of orthogonally invariant random weight matrices. In the singular value decomposition of such weight matrices, $\underline{\underline{W}}=\underline{\underline{U}}\;\underline{\underline{S}}\;{\underline{\underline{V}}}^{\top }\in {\mathbb{R}}^{M{\times}N}$, the orthogonal basis matrices $\underline{\underline{U}}$ and $\underline{\underline{V}}$ are drawn uniformly at random from respectively O(M) and O(N), while the diagonal matrix of singular values $\underline{\underline{S}}$ has an arbitrary spectrum. The consistency between the EC free energy and the replica derivation for orthogonally invariant matrices was verified by [KV14] for signal recovery from linear measurements (the GLM without G). From the algorithmic perspective, Fletcher, Rangan and Schniter [RSF17, SRF16] applied the EP to the GLM to obtain the (generalized) vector-approximate message passing (G-VAMP) algorithm. Remarkably, these authors proved that the behavior of the algorithm could be characterized in the thermodynamic limit, provided the weight matrix is drawn from the orthogonally invariant ensemble, by a set of scalar state evolution equations similarly to the AMP algorithm. These equations are again related to the saddle point equations of the replica free energy. Concurrently, Opper, Cakmak and Winther proposed an alternative procedure for solving TAP equations with orthogonally invariant weight matrices in Ising spin systems relying on an analysis of iterative algorithms [ÇO19, OÇW16]. Finally, [MFC+19] revisits the above cited contributions and provides detailed considerations on their connections.

Below we present the aforementioned free energy as proposed by [Kab08, SK08, SK09], and the G-VAMP algorithm of [SRF16].

5.2.1. Replica free energy for the GLM in the Bayes optimal setting

Consider the ensemble of orthogonally invariant weight matrices $\underline{\underline{W}}$ with spectral density ${\sum }_{i=1}^{N}\delta \left(\lambda -{\lambda }_{i}\right)/N$ of their 'square' $\underline{\underline{W}}{\underline{\underline{W}}}^{\top }$ converging in the thermodynamic limit N → + to a given density ρλ(λ). The quenched free energy of the GLM in the Bayes optimal setting derived in [Kab08, SK09] writes

Equation (131)

Equation (132)

where ${\mathcal{I}}_{x}$ and ${\mathcal{I}}_{z}$ were defined as (126) and (127) and the spectral density ρλ(λ) appears via its mean ${\lambda }_{0}={\mathbb{E}}_{\lambda }\left[\lambda \right]$ and in the definition of

Equation (133)

Gaussian random matrices are a particular case of the considered ensemble. Their singular values are characterized asymptotically by the Marcenko–Pastur distribution [MP67]. In this case, one can check that the above expression reduces to (125). More generally, note that ${\mathcal{J}}_{z}$ generalizes ${\mathcal{I}}_{z}$.

5.2.2. Vector approximate message passing for the GLM

The VAMP algorithm consists in writing EP [Min01] with Gaussian messages on the factor graph representation of the GLM posterior distribution given in figure 5. The estimation of the signal $\underline{x}$ is decomposed onto four variables, two duplicates of $\underline{x}$ itself and two duplicates of the linear transformation $\underline{z}=\underline{\underline{W}}\underline{x}$. The potential functions ψx and ψz of factors connecting copies of the same variable are Dirac distributions enforcing their equality. The factor node linking ${\underline{z}}^{\left(2\right)}$ and ${\underline{x}}^{\left(2\right)}$ is assumed Gaussian with variance going to zero.The procedure of derivation, equivalent to the projection of the BP algorithm on Gaussian messages, is recalled in appendix E and leads to algorithm 2. Like for AMP, the algorithm features some auxiliary variables introduced along the derivation. At convergence the duplicated ${\hat{\underline{x}}}_{1}$, ${\hat{\underline{x}}}_{2}$ (and ${\hat{\underline{z}}}_{1}$, ${\hat{\underline{z}}}_{2}$) are equal and either can be returned by the algorithm as an estimator. For readability, we omitted the time indices in the iterations that here simply follow the indicated update.

Figure 5.

Figure 5. Factor graph representation of the GLM for the derivation of VAMP.

Standard image High-resolution image

Algorithm 2. Vector approximate message passing

Input: vector of observations $\underline{y}\in {\mathbb{R}}^{M}$ and weight matrix $\underline{\underline{W}}\in {\mathbb{R}}^{M{\times}N}$:
Initialize: ${\underline{\underline{A}}}_{1}^{x}$, ${\underline{B}}_{1}^{x}$, ${\underline{\underline{A}}}_{1}^{z}$, ${\underline{B}}_{1}^{z}$
repeat
Equation (134)
Equation (135)
Equation (136)
Equation (137)
Equation (138)
Equation (139)
Equation (140)
Equation (141)
until convergence
Output: signal estimate ${\hat{\underline{x}}}_{1}\in {\mathbb{R}}^{N}$, and estimated covariance ${\underline{\underline{{C}^{x}}}}_{1}\in {\mathbb{R}}^{N{\times}N}$

For a given instance of the GLM inference problem, i.e. a given weight matrix $\underline{\underline{W}}$, one can always launch either the AMP algorithm or the VAMP algorithm to attempt the reconstruction. If the weight matrix has i.i.d. zero mean Gaussian entries, the two strategies are conjectured to be equivalent and GAMP can be provably convergent for certain settings [RSF14]. If the weight matrix is not Gaussian but orthogonally invariant, then only VAMP is expected to always converge. More generally, even in cases where none of these assumptions are verified, VAMP has been observed to have less convergence issues than AMP.

Like for AMP, a state evolution can also be derived for VAMP (which was actually directly proposed for the multi-layer GLM [FRS18]). It rigorously characterizes the behavior of the algorithm when $\underline{\underline{W}}$ is orthogonally invariant. One can also verify that the SE fixed points can be mapped to the solutions of the saddle point equations of the replica free energy (131) (see section 1 of supplementary material of [GML+18]); so that the robust algorithmic procedure can advantageously be used to compute the fixed points to be plugged in the replica potential to approximate the free energy.

5.3. Multi-value AMP

A recent extension of AMP consists in treating the simultaneous reconstruction of multiple signals undergoing the same mixing step from the corresponding multiple observations. This is a situation of particular interest for learning appearing for instance in the teacher–student set-up of committee machines. The authors of [AMB+18] showed that the different weight vectors of these neural networks can be inferred from the knowledge of training input–output pairs introducing this extended version of AMP. Here the same matrix of training input data mixes the teacher weight vectors to produce the training output data. For a matter of consistency with the examples used in the previous sections, we here formalize the algorithm for the GLM. Nevertheless this is just a matter of rewriting of the committee algorithm of [AMB+18].

Concretely let us consider a GLM with P pairs of signal and observations ${\left\{{\underline{x}}_{\left(k\right)},{\underline{y}}_{\left(k\right)}\right\}}_{k=1}^{P}$, gathered in matrices $\underline{\underline{X}}\in {\mathbb{R}}^{N{\times}P}$ and $\underline{\underline{Y}}\in {\mathbb{R}}^{M{\times}P}$. We are interested in the posterior distribution

Equation (142)

Compared to the traditional GLM measure (51), scalar variables are here replaced by vectors in ${\mathbb{R}}^{P}$. In appendix F we present a derivation starting from BP of the corresponding AMP presented in algorithm 3. The major difference with the scalar GLM consists in the necessity of tracking covariance matrices between the P different variables instead of simple variances.

Algorithm 3. Approximate message passing for multi-value GLM

Input: matrix $\underline{\underline{Y}}\in {\mathbb{R}}^{M{\times}P}$ and matrix $\underline{\underline{W}}\in {\mathbb{R}}^{M{\times}N}$:
Initialize: ${\hat{\underline{x}}}_{i}$, ${\underline{\underline{{C}^{x}}}}_{i}\quad \forall i$ and ${\underline{{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}}_{\mu }$, ${\partial }_{\underline{\omega }}{\underline{\underline{{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}}}_{\mu }\quad \forall \mu $
repeat
(1) Estimate mean and variance of ${\underline{z}}_{\mu }$ given current ${\hat{\underline{x}}}_{i}$
Equation (143)
Equation (144)
(2) Estimate mean and variance of the gap between optimal ${\underline{z}}_{\mu }$ and ${\underline{\omega }}_{\mu }$ given ${\underline{y}}_{\mu }$
Equation (145)
Equation (146)
(3) Estimate mean and variance of ${\underline{x}}_{i}$ given current optimal ${\underline{z}}_{\mu }$
Equation (147)
Equation (148)
(4) Estimate of mean and variance of ${\underline{x}}_{i}$ augmented of the information about the prior
Equation (149)
Equation (150)
until convergence

This AMP algorithm can also be analyzed by a state evolution. In [AMB+18], the teacher–student matched setting of the committee machine is examined through the replica approach and the Bayes optimal state evolution equations are obtained as the saddle point equations of the replica free energy. In appendix F we present the alternative derivation of the state evolution equations from the message passing and without assuming a priori matched teacher and student, as done in [GBKZ19].

5.4. Model composition and multi-layer inference

Another recent and ongoing direction of extension of mean-field methods is the combination of solutions of elementary models to tackle more sophisticated inference questions. The graphical representations of probability distributions (reintroduced briefly in appendix B) are here of great help. In a complicated joint probability distribution, it is sometimes possible to identify well-known sub-models, such as the GLM or the RBM. Understanding how and when it is justified to plug-in different solutions is of course non-trivial and a very promising direction of research.

A particularly relevant extension in this direction is the treatment of multi-layer GLMs, or in other words multi-layer neural networks. With depth L, hidden layers noted ${\underline{u}}^{\ell }\in {\mathbb{R}}^{{N}_{\ell }}$, and weight matrices ${\underline{\underline{{\Phi}}}}^{\ell }\in {\mathbb{R}}^{{N}_{\ell +1}{\times}{N}_{\ell }}$, it formally corresponds to the statistical model

Equation (151)

Equation (152)

Equation (153)

In [MKMZ17] a multi-layer version of AMP is derived, assuming Gaussian i.i.d weight matrices, along with a state evolution and a replica free energy. Remarkably, the asymptotic replica prediction was mathematically proven correct in [GML+18]. In [FRS18], the multi-layer version of the VAMP algorithm is derived with the corresponding state evolution for orthogonally invariant weight matrices. The matching free energies were obtained independently in [GML+18] by the generalization of a replica result and by [RP16] from a different argument.

In the next paragraph we sketch a derivation of the two-layer AMP presented in algorithm 4, it provides a good intuition of the composition ability of mean-field inference methods.

Algorithm 4. Generalized approximate message passing for the two-layer GLM

Input: vector $\underline{y}\in {\mathbb{R}}^{M}$ and matrices ${\underline{\underline{{\Phi}}}}^{0}\in {\mathbb{R}}^{M{\times}N}$, ${\underline{\underline{{\Phi}}}}^{1}\in {\mathbb{R}}^{Q{\times}M}$:
Initialize: $\hat{\underline{x}}\in {\mathbb{R}}^{N}$, ${\underline{C}}^{x}\in {\mathbb{R}}^{N}$, $\underline{\hat{u}}\in {\mathbb{R}}^{M}$, ${\underline{C}}^{u}\in {\mathbb{R}}^{M}$, $\underline{{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}^{1}}\in {\mathbb{R}}^{M}$, ${\underline{\partial {g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}}^{1}\in {\mathbb{R}}^{M}$, $\underline{{g}_{\mathrm{o}\mathrm{u}\mathrm{t}}^{2}}\in {\mathbb{R}}^{Q}$, ${\underline{\partial {g}_{\mathrm{o}\mathrm{u}\mathrm{t}}}}^{2}\in {\mathbb{R}}^{Q}$ and
t = 0.
repeat
(1) Update auxiliary variables of second layer:
Equation (158)
Equation (159)
Equation (160)
Equation (161)
Equation (162)
Equation (163)
(2) Update auxiliary variables of first layer:
Equation (164)
Equation (165)
Equation (166)
Equation (167)
Equation (168)
Equation (169)
(3) Update means and variances of variables of both layers, $\underline{x}$ and $\underline{u}$:
Equation (170)
Equation (171)
Equation (172)
Equation (173)
t = t + 1
until convergence
Output: signal estimate $\hat{\underline{x}}\in {\mathbb{R}}^{N}$, and estimated variance ${\underline{C}}^{x}\in {\mathbb{R}}^{N}$

Heuristic derivation of two-layer AMP. The derivation of the multi-layer AMP follows identical steps to the derivation of the single layer presented in section 4.3, yet for a more complicated factor graph and consequently a larger collection of messages. Without conducting the lengthy procedure, one can form an intuition for the resulting algorithm starting from the single-layer AMP. The corresponding factor graph is given on figure 6. Compared to the single-layer case (see figure 3), an interface with a set of M = N1 hidden variables uμ is inserted between the N = N0 signals xi and the Q = N2 observations ya. In the neighborhood of the inputs xi the factor graph is identical to the single-layer and the input functions can be defined from a normalization partition identical to (85),

Equation (154)

yielding updates (170) and (171). Similarly, the neighborhood of the observations ya is also unchanged and the updates (160) and (161) are following from the definition of

Equation (155)

identical to the single layer (79). At the interface however, the variables uμ play the role of outputs for the first GLM and of inputs for the second GLM, which translates into a normalization partition function of mixed form

Updates (166) and (167) are obtained by considering that the second layer acts as an effective channel for the first layer, i.e. from the normalization interpreted as

Equation (156)

Finally, update equations (172) and (173) are in turn derived considering the first layer defines an effective prior for the hidden variables and rewriting the normalization as

Equation (157)

The rest of the algorithm updates follows as usual from the self-consistency between the different variables introduced as they correspond to different parametrization of the same marginals. The schedule of updates and the time indexing reported in algorithm 4 results from the entire derivation starting from the BP messages. The generalization of the algorithm to an arbitrary number of layers is easily obtained repeating the heuristic arguments presented here.

Figure 6.

Figure 6. Factor graph representation of a generic two-layer GLM.

Standard image High-resolution image

6. Some applications

6.1. A brief pre-deep learning history

The application of mean-field methods of inference to machine learning, and in particular to neural networks, already have a long history and significant contributions to their records. Here we briefly review some historical connections anterior to the deep learning revival of neural networks in the 2010s.

Statistical mechanics of learning. In the 80s and 90s, a series of works pioneered the analysis of learning with neural networks through the statistical physics lense. By focusing on simple models with simple data distributions, and relying on the mean-field method of replicas, these papers managed to predict quantitatively important properties such as capacities: the number of training data point that could be memorized by a model, or learning curves: the generalization error (or population risk) of a model as a function of the size of the training set. This effort was initiated by the study of the Hopfield model [AGS85], an undirected neural network providing associative memory [Hop82]. The analysis of feed forward networks with simple architectures followed (among which [Gar87, Gar88, MZ95, MZ04, OH91, OW96], see also the reviews [EV01, Opp95, Saa99, SST92, WRB93]). The dynamics of simple learning problems was also analyzed through a mean-field framework (not covered in the previous sections) initially in the simplifying case of online learning with infinite training set [BS95, Saa99, SS95a, SS95b] but also with finite data [LM99, SB97].

Physicists, accustomed to studying natural phenomena, fruitfully brought the tradition of modeling to their investigation of learning, which translated into assumptions of random data distributions or teacher–student scenarios. Their approach was in contrast to the focus of the machine learning theorists on worst case guarantees: bounds for an hypothesis class that hold for any data distribution (e.g. Vapnik–Chervonenkis dimension and Rademacher complexity). The originality of the physicists approach, along with the heuristic character of the derivations of mean-field approximations, may nonetheless explain the minor impact of their theoretical contributions in the machine learning community at the time.

Mean-field algorithms for practictioners. Along with these contributions to the statistical mechanics theory of learning, new practical training algorithms based on mean-field approximations were also proposed at the same period (see e.g. [OW96, Won95, Won97]). Yet, before the deep learning era, mean-field methods probably had a greater influence in the practice of unsupervised learning through density estimation, where we saw that approximate inference is almost always necessary. In particular the simplest method of naive mean-field, our first example in section 4, was easily adopted and even extended by statisticians (see e.g. [WJ08] for a recent textbook and [BKM17] for a recent review). The belief propagation algorithm is another example of a well known mean-field methods by machine learners, as it was actually discovered in both communities. Yet, for both methods, early applications rarely involved neural networks and rather relied on simple probabilistic models such as mixtures of elementary distributions. They also did not take full advantage of the latest simultaneous developments in statistical physics of the mean-field theory of disordered systems.

Transferring advanced mean-field methods. In this context, the inverse Ising problem has been a notable exception. The underlying question, rooted in theoretical statistical physics, is to infer the parameters of an Ising model given a set of equilibrium configurations. This is related to the unsupervised learning of the parameters of a Boltzmann machine (without hidden units) in the machine learning jargon, while it does not necessarily rely on a maximum likelihood estimation using gradients. The corresponding Boltzmann distribution, with pairwise interactions, is remarkable, not only to physicists. It is the least biased model under the assumption of fixed first and second moments in the sense that it maximizes the entropy. For this problem, physicists proposed dedicated developments of advanced mean-field methods for applications in other fields, and in particular in bio-physics (see [NZB17] for a recent review). A few works even considered the case of Boltzmann machines with hidden units, more common in the machine learning community [Gal93, PA87].

Beyond the specific case of Boltzmann machines, the language barrier between communities is undoubtedly a significant hurdle delaying the global transfer of developments in one field to the other. In machine learning, the potential of the most recent progress of mean-field approximations was advocated for in a pioneering workshop mixing communities in 1999 [OS01]. Yet the first widely-used application is possibly the approximate message passing (AMP) algorithm for compressed sensing in 2009 [DMM09]. Meanwhile, in the different field of constraint satisfaction problems (CSPs), there have been much tighter connections between developments in statistical physics and algorithmic solutions. The very first popular application of advanced mean-field methods outside of physics, beyond naive mean-field and belief propagation, is probably the survey propagation algorithm [MPZ02] in 2002. It borrows from the 1RSB cavity method (not treated in the present paper) to solve efficiently certain types of CSPs.

6.2. Some current directions of research

The great leap forward in the performance of machine learning with neural networks brought by deep learning algorithms, along with the multitude of theoretical and practical challenges it has opened, has re-ignited the interest of physicists for the theory of neural networks. In this section, far from being exhaustive, we review some current directions of research leveraging mean-field approximations. Another relevant review is [CCC+19], which provides references both for machine learning research helped by physics methods and conversely research in physics using machine learning.

Works presented below do not necessarily implement one of the classical inference methods presented in sections 4 and 5. In some cases, the mean-field limit corresponds to some asymptotic setting where the problem simplifies: typically some correlations weaken, fluctuations are averaged out by concentration effects and, as a result, ad hoc methods of resolution can be designed. Thus, in the following contributions, different assumptions are considered to serve different objectives. For instance some take an infinite size limit, some assume random (instead of learned) weights or vanishing learning rates. Hence, there is no such a thing as one mean-field theory of deep neural networks. The below cited works are rather complementary pieces of solving a great puzzle.

6.2.1. Neural networks for unsupervised learning

Fundamental study of learning. Given their similarity with the Ising model, restricted Boltzmann machines have unsurprisingly attracted a lot of interest. Studying an ensemble of RBMs with random parameters using the replica method, Tubiana and Monasson [TM17] evidenced different regimes of typical pattern of activations in the hidden units and identified control parameters as the sparsity of the weights, their strength (playing the role of an effective temperature) or the type of prior for the hidden layer. Their study contributes to the understanding of the conditions under which the RBMs can represent high-order correlations between input units, albeit without including data and learning in their model. Barra and collaborators [BGST17, BGST18], exploited the connections between the Hopfield model and RBMs to characterize RBM learning understood as an associative memory. Relying again on replica calculations, they characterize the retrieval phase of RBMs. Mézard [Méz17] also re-examined retrieval in the Hopfield model using its RBM representation and message passing, showing in particular that the addition of correlations between memorized patterns could still allow for a mean-field treatment at the price of a supplementary hidden layer in the Boltzmann machine representation. This result remarkably draws a theoretical link between correlations in the training data and the necessity of depth in neural network models.

While the above results do not include characterization of the learning driven by data, a few others were able to discuss the dynamics of training. Huang [Hua17] studied with the replica method and TAP equations the Bayesian leaning of an RBM with a single hidden unit and binary weights. Barra and collaborators [BGST17] empirically studied a teacher–student scenario of unsupervised learning by maximum likelihood on samples of an Hopfield model which they could compare to their theoretical characterization of the retrieval phase. Decelle and collaborators [DFF17, DFF18] introduced an ensemble of RBMs characterized by the spectral properties of the weight matrix and derived the typical dynamics of the corresponding order parameters during learning driven by data. Beyond RBMs, analyses of the learning in other generative models are starting to appear [WHLP18].

Training algorithm based on mean-field methods. Beyond bringing theoretical insights, mean-field methods are also found useful to build tractable estimators of the likelihood in generative models, which in turn serves to design novel training algorithms.

For Boltzmann machines, this direction was already investigated in the 80s and 90s, [Gal93, Hin89, KR98, PA87], albeit in small models with binary units and for artificial data sets very different from modern machine learning benchmarks. More recently, a deterministic training based on naive mean-field was tested on RBMs [Tie08, WH02]. On toy deep learning data sets, the algorithm was found to perform poorly when compared to both CD and PCD, the commonly employed approximate Monte Carlo methods. However going beyond naive mean-field, considering the second order TAP expansion, allows to bridge the gap in efficiency [GTK15, TGM+18]. Additionally, the deterministic mean-field framework offers a tractable way of evaluating the learning success by exploiting the mean-field observables to visualize the representation learned by RBMs. Interestingly, high temperature expansions of estimators different from the maximum likelihood have also been recently proposed as efficient inference method for the inverse Ising problem [LVMC18].

By construction, variational auto-encoders (VAEs) rely on a variational approximation of the likelihood. In practice, the posterior distribution of the latent representation given an input (see section 2.2) is typically approached by a factorized Gaussian distribution with mean and variance parametrized by neural networks. The factorization assumption relates the method to a naive mean-field approximation.

Structured Bayesian priors. With the progress of unsupervised learning, the idea of using generative models as expressive priors has emerged.

For reconstruction tasks, in the event where a set of typical signals is available a priori, the latter can serve as a training set to learn a model of the underlying data distribution with a generative model. Subsequently, in the reconstruction of a new signal of the same type, the generative model can serve as a Bayesian prior. In particular, the idea to exploit RBMs in CS applications was pioneered by [DHD12] and [TDK16], who trained binary RBMs using contrastive divergence (to locate the support of the non-zero entries of sparse signals) and combined it with an AMP reconstruction. They demonstrated drastic improvements in the reconstruction with structured learned priors compared to the usual sparse unstructured priors. The approach, requiring to combine the AMP reconstruction for CS and the RBM TAP inference, was further generalized in [TGM+18, TMC+16] to real valued distributions. In the line of these applications, several works have also investigated using feed forward generative models for inference tasks. Using this time multi-layer VAMP inference, Rangan and co-authors [PSRF19] showed that VAEs could help for in-painting partially observed images. Note also that a different line of works, mainly considering GANs, examined the same type of applications without resorting to mean-field algorithms [BJPD17, HLV18, HV18, MV18]. Instead they performed the inference via gradient descent and back-propagation.

Another application of generative priors is to model synthetic data sets with structure. In [ALM+19, GMKZ19, GML+18], the authors designed learning problems amenable to a mean-field theoretical treatment by assuming the inputs to be drawn from a generative prior (albeit with untrained weights so far). This approach goes beyond the vanilla teacher–student scenario where input data is typically unstructured with i.i.d. components. This is a crucial direction of research as the role of structure in data appears as an important component to understand the puzzling efficiency of deep learning.

6.2.2. Neural networks for supervised learning

New results in the replica analysis of learning. The classical replica analysis of learning with simple architectures, following bases set by Gardner and Derrida 30 years ago, continues to be explored. Among the most prominent results, Kabashima and collaborators [Kab08, SK08, SK09] extended the mean-field treatment of the perceptron from data matrices with i.i.d. entries to random orthogonal matrices. It is a much larger class of random matrices where matrix entries can be correlated. More recently, a series of works explored in depth the specific case of the perceptron with binary weight values for classification on random inputs. Replica computations showed that the space of solutions is dominated in the thermodynamic limit by isolated solutions [HK14, HWK13], but also that subdominant dense clusters of solutions exist with good generalization properties in the teacher–student scenario case [BBC+16, BGK+18, BIL+15]. This observation inspired a novel training algorithm [CCS+17]. The simple two-layer architecture of the committee machine was also reexamined recently [AMB+18]. In the teacher–student scenario, a computationally hard phase of learning was evidenced by comparing a message passing algorithm (believed to be optimal) and the replica prediction. In this work, the authors also proposed a strategy of proof of the replica prediction.

Signal propagation in depth. Mean-field approximations can also help understand the role and implications of depth by characterizing signal propagation in neural networks. The following papers consider the width of each layer to go to infinity. In this limit, Sompolinsky and collaborators characterized how neural networks manage to progressively separate data manifolds fed as inputs [CCLS19, CLS18, KS16]. Another line of works focused on the initialization of neural networks (i.e. with random weights), and found an order-to-chaos transition in the signal propagation as a function of hyperparameters of training [PLR+16, SBG+17]. As a result, the authors could formulate recommendations for combinations of hyperparameters to practitioners. This type of analysis could furthermore be generalized to convolutional networks [NXL+19], recurrent networks [GCC+19] and networks with batch-normalization regularization [YPR+19]. The space of functions spanned by deep random networks in the infinite-size limit was also studied by [LS18, LS20], using the different but related approach of the generating functional analysis. Yet another mean-field argument, this time relying on a replica computation, allowed to compute the mutual information between layers of large non-linear deep neural networks with orthogonally invariant weight matrices [GML+18]. Using this method, mutual informations can be followed precisely along the learning for an appropriate teacher–student scenario. The strategy offers an experimental test bed to characterize possible links between the generalization ability of deep neural networks and information compression phases in the training (see [SBD+18, SZT17, TZ15]).

Dynamics of SGD learning in simple networks and generalization. A number of different mean-field limits led to interesting analyses of the dynamics of gradient descent learning. In particular, the below mentioned works contribute to shed light on the generalization power of neural networks in the so-called overparametrized regimes, that is where the number of parameters exceeds largely either the number of training points or the underlying degrees of freedom of the teacher rule. In linear networks first, an exact description in the high-dimensional limit was obtained for the teacher-student setup by [AS17] using random matrix theory. The generalization was predicted to improve with the overparametrization of the student. Non-linear networks with one infinitely wide hidden layer were considered by [CB18b, MMN18, RVE18, SS18] who showed that gradient descent converges to a finite generalization error. Their results are related to others obtained in a slightly different limit of infinitely large neural networks [JGH18]. For arbitrarily deep networks, Jacot and collaborators [JGH18] showed that, in a certain setting, gradient descent was effectively performing a kernel regression with a kernel function converging to a fixed value for the entire training as the size of the layers increases. In both related limits, the absence of divergence is accounting for generalization not deteriorating despite of the explosion of the number of parameters. The relationship between the two above limits was discussed in [CB18a, GSJW19, MMM19]. Subsequent works, leveraged the formalism introduced in [JGH18]. Scaling for the generalization error as a function of network sizes were derived by [GJS+19]. Other authors focused on the characterization of the network output function in this limit, which takes the form of a Gaussian process [LXS+19]. This fact was probably first noticed by Opper and Winther with one hidden layer [OW99b], to whom it inspired a TAP based Bayesian classification method using Gaussian processes. Finally, yet another limit was analyzed by [GAS+19], considering a finite number of hidden units with an infinitely wide input. Following classical works on the mean-field analysis of online learning (not covered in the previous sections [BS95, Saa99, SS95a, SS95b]), a closed set of equations can be derived and analyzed for a collection of overlaps. Note that these are the same order parameters as in replica computations. The resulting learning curves evidence the necessity of multi-layer learning to observe the improvement of generalization with overparametrization. An interplay between optimization, architecture and data sets seems necessary to explain the phenomenon.

7. Conclusion

This review aimed at presenting in a pedagogical way a selection of inference methods coming from statistical physics. In past and current lines of research that were also reviewed, these methods are sometimes turned into practical and efficient inference algorithms, or sometimes the angle stone in theoretical computations.

What is missing. There are more of these methods beyond what was covered here. In particular the cavity method [MPV86], closely related to message passing algorithms and the replica formalism, played a crucial role in the physics of spin glasses. Note also that we assumed replica symmetry, which is only guaranteed to be correct in the Bayes optimal case. References of introductions to replica symmetry breaking are [CCFM05, MPV86], and newly proposed message passing algorithms with RSB are [AFUZ19, AKUZ19, SLL19]. The methods of analysis of online learning algorithms pioneered by [BS95, SS95a, SS95b] and reviewed in [Saa99] also deserve the name of classical mean-field analysis. They are currently actively serving research efforts in deep learning theory [GAS+19]. Another important method is the off-equilibrium mean-field theory [CHS93, CK93, CS88], recently used for example to characterize a specific type of neural networks called graph neural networks [KTO18] or to study properties of gradient flows [MKUZ19].

On the edge of validity. We have also touched upon the limitations of the mean-field approach. To start with, the thermodynamic limit is ignoring finite-size effects. Moreover, different ways of taking the thermodynamic limit for the same problem sometimes lead to different results. Also, necessary assumptions of randomness for weights or data matrices are sometimes in clear contrast with real applications.

Thus, the temptation to apply abusively results from one field to the other can be a dangerous pitfall of the interdisciplinary approach. We could mention here the characterization of the dynamics of optimization. While physicists have extensively studied Langevin dynamics with Gaussian white noise, the continuous time limit of SGD is unfortunately not an equivalent in the general case. While some works attempt to draw insights from this analogy using strong assumptions (e.g. [CHM+15, JKA+17]), others seek precisely to understand the differences between the two dynamics in neural networks optimization (e.g. [BJSG+18, SSG19]). Alternatively, another good reason to consider the power of mean-field methods lies in the observation rooted in the tradition of theoretical physics that one can learn from models a priori far from the exact neural networks desired, but that retain some key properties, while being amenable to theoretical characterization. For example, [MKUZ19] studied a high-dimensional non-convex optimization problem inspired by the physics of spin glasses apparently unrelated to neural networks, but gained insights on the dynamics of gradient descent (and Langevin) that is of primal interest. Another example of this surely promising approach is [WHLP18], who built and analyzed a minimal model of GANs.

Moreover, the possibility to combine well-studied simple settings to obtain a mean-field theory for more complex models, as recently demonstrated in a series of work [ALM+19, FRS18, GML+18, MKMZ17, TDK16, TGM+18, TMC+16], constitutes an exciting direction of research that should broaden considerably the limit of applications of mean-field methods.

Patching the pieces together and going further. Thus the mean-field approach alone cannot to this day provide complete answers to the still numerous puzzles on the way toward a deep learning theory. Yet, considering different limits and special cases, combining solutions to approach ever more complex models, the approach should help uncover more and more corners of the big black box. Hopefully, intuition gained at the edge will help revealing the broader picture.

Acknowledgments

This paper is based on the introductory chapters of my PhD dissertation, written under the supervision of Florent Krzakala and in collaboration with Lenka Zdeborová, both, to whom I am very grateful. I would like to thank also Benjamin Aubin, Cédric Gerbelot, Adrien Laversanne-Finot, and Guilhem Semerjian for their comments on the manuscript. I gratefully acknowledge the support of the 'Chaire de recherche sur les modèles et sciences des données' by Fondation CFM pour la Recherche-ENS and of Fondation L'Oréal For Women In Science. I also thank the Kalvi Institute for Theoretical Physics, where part of this work was written.

Appendix A.: Index of notations and abbreviations

[$N$]Set of Integers from 1 to N
δ(⋅)Dirac Distribution
$\sigma \left(x\right)={\left(1+{\mathrm{e}}^{-x}\right)}^{-1}$Sigmoid
$\mathrm{r}\mathrm{e}\mathrm{l}\mathrm{u}\left(x\right)=\mathrm{m}\mathrm{a}\mathrm{x}\left(0,x\right)$Rectified Linear Unit
$\underline{\underline{X}}$Matrix
$\underline{x}$Vector
${\underline{\underline{I}}}_{N}\in {\mathbb{R}}^{N{\times}N}$Identity Matrix
⟨⋅⟩Average with Respect to the Boltzmann Distribution
$\mathrm{O}\left(N\right)\subset {\mathbb{R}}^{N{\times}N}$Orthogonal Ensemble
1RSB1 Step Replica Symmetry Breaking
AMPApproximate Message Passing
BPBelief Propagation
cal-AMPCalibration Approximate Message Passing
CDContrastive Divergence
CSCompressed Sensing
CSPConstrain Satisfaction Problem
DAGDirected Acyclic Graph
DBMDeep Boltzmann Machine
ECExpectation Consistency
EPExpectation Propagation
GAMPGeneralized Approximate Message Passing
GANGenerative Adversarial Networks
GDGradient Descent
GLMGeneralized Linear Model
G-VAMPGeneralized Vector Approximate Message Passing
i.i.d.Independent Identically Distributed
PCDPersistent Contrastive Divergence
r-BPRelaxed Belief Propagation
RSReplica Symmetric
RSBReplica Symmetry Breaking
RBMRestricted Boltzmann Machine
SEState Evolution
SGDStochastic Gradient Descent
SKSherrington–Kirkpatrick
TAPThouless Anderson Palmer
VAEVariational Autoencoder
VAMPVector Approximate Message Passing

Appendix B.: Statistical models representations

Graphical representations have been developed to represent and efficiently exploit (in)dependencies between random variables encoded in joint probability distributions. They are useful tools to concisely present the model under scrutiny, as well as direct supports for some derivations of inference procedures. Let us briefly present two types of graphical representations.

Probabilistic graphical models. Formally, a probabilistic graphical model is a graph G = (V, E) with nodes V representing random variables and edges E representing direct interactions between random variables. In many statistical models of interest, it is not necessary to keep track of all the possible combinations of realizations of the variables as the joint probability distribution can be broken up into factors involving only subsets of the variables. The structure of the connections E reflects this factorization.

There are two main types of probabilistic graphical models: directed graphical models (or Bayesian networks) and undirected graphical models (or Markov random fields). They allow to represent different independencies and factorizations. In the next paragraphs we provide intuitions and remind some useful properties of graphical models, a good reference to understand all the facets of this powerful tool is [KF09].

Undirected graphical models. In undirected graphical models the direct interaction between a subset of variables CV is represented by undirected edges interconnecting each pair in C. This fully connected subgraph is called a clique and associated with a real positive potential function ψC over the variable ${\underline{x}}_{C}={\left\{{x}_{i}\right\}}_{i\in C}$ carried by C. The joint distribution over all the variables carried by V, ${\underline{x}}_{V}$ is the normalized product of all potentials

Equation (B.1)

Example (i): the restricted Boltzmann machine,

Equation (B.2)

with factorized px and pt is handily represented using an undirected graphical model depicted in figure 2a. The corresponding set of cliques is the set of all the pairs with one input unit (indexed by i = 1⋯N) and one hidden unit (indexed by α = 1⋯M), joined with the set of all single units. The potential functions are immediately recovered from (B.2),

Equation (B.3)

Equation (B.4)

It belongs to the subclass of pairwise undirected graphical models for which the size of the cliques is at most two.

Undirected graphical models handily encode conditional independencies. Let A, B, SV be three disjoint subsets of nodes of G. A and B are said to be independent given S if p(A, B|S) = p(A|S)p(B|S). In the graph representation it corresponds to cases where S separates A and B: there is no path between any node in A and any node in B that is not going through S.

Example (i): In the RBM, hidden units are independent given the inputs, and conversely:

Equation (B.5)

This property is easily spotted by noticing that the graphical model (figure 2(a)) is bipartite.

Directed graphical model. A directed graphical model uses a directed acyclic graph (DAG), specifying directed edges E between the random variables V. It induces an ordering of the random variables and in particular the notion of parent nodes πiV of any given vertex iV: the set of vertices j such that jiE. The overall joint probability distribution factorizes as

Equation (B.6)

Example (ii): The stochastic single layer feed forward network $\underline{y}=g\left(\underline{\underline{W}}\underline{x};\underline{{\epsilon}}\right)$, where $g\left(\cdot ;\underline{{\epsilon}}\right)$ is a function applied component-wise including a stochastic noise $\underline{{\epsilon}}$ that is equivalent to a conditional distribution ${p}_{\text{out}}\left(\underline{y}\vert \underline{\underline{W}}\underline{x}\right)$, and where inputs and weights are respectively drawn from distributions ${p}_{x}\left(\underline{x}\right)$ and ${p}_{W}\left(\underline{\underline{W}}\right)$, has a joint probability distribution

Equation (B.7)

precisely following such a factorization. It can be represented with a three-node DAG as in figure 7. Here we applied the definition at the level of vector/matrix valued random variables. By further assuming that pout, pW and px factorize over their components, we keep a factorization compatible with a DAG representation

Equation (B.8)

For the purpose of reasoning it may be sometimes necessary to get to the finest level of decomposition, while sometimes the coarse grained level is sufficient.

Figure 7.

Figure 7. Left: directed graphical model for $p\left(\underline{x},\underline{y},\underline{\underline{W}}\right)$ without assumptions of factorizations for the channel and priors. Right: directed graphical model reflecting factorization assumptions for $p\left(\underline{x},\underline{y}\vert \underline{\underline{W}}\right)$.

Standard image High-resolution image

While a statistical physicist may have never been introduced to the formal definitions of graphical models, she inevitably already has drawn a few—for instance when considering the Ising model. She also certainly found them useful to guide physical intuitions. The following second form of graphical representation is probably newer to her.

Factor graph representations. Alternatively, high-dimensional joint distributions can be represented with factor graphs, that are undirected bipartite graphs G = (V, F, E) with two subsets of nodes. The variable nodes V representing the random variables as in the previous section (circles in the representation) and the factor nodes F representing the interactions (squares in the representation) associated with potentials. The edge () between a variable node i and a factor node μ exists if the variable i participates in the interaction μ. We note ∂i the set of factor nodes in which variable i is involved, they are the neighbors of i in G. Equivalently we note ∂μ the neighbors of factor μ in G, they carry the arguments ${\left\{{x}_{i}\right\}}_{i\in \partial \mu }$, shortened as ${\underline{x}}_{\partial \mu }$, of the potential ψμ. The overall distributions is recovered in the factorized form:

Equation (B.9)

Compared to an undirected graphical model, the cliques are represented by the introduction of factor nodes.

Examples: the factor-graph representation of the RBM (i) is not much more informative than the pairwise undirected graph (see figure 2a). For the feed forward neural networks (ii) we draw the factor graph of $p\left(\underline{y},\underline{x}\vert \underline{\underline{W}}\right)$ (see figure 2b).

Appendix C.: Mean-field identity

We derive the exact identity (42) for fully connected Ising model with binary spins $\underline{x}\in {\left\{0,1\right\}}^{N}$,

Equation (C.1)

where ${\underline{x}}_{{\backslash}i}$ is the vector of $\underline{x}$ without its ith component. Yet

Equation (C.2)

so that multiplying and dividing (C.1) by the denominator above we obtain the identity (42) in section 4.1

Equation (C.3)

Appendix D.: Georges–Yedidia expansion for generalized Boltzmann machines

We here present a derivation of the Georges–Yedidia for real-valued degrees of freedom on the example of a Boltzmann machine as in [TGM+18]. Formally we consider $\underline{x}\in {\mathbb{R}}^{N}$ governed by the energy function and parametrized distribution

Equation (D.1)

where px(xi; θi) is an arbitrary prior distribution with parameter θi. For a Bernoulli prior with parameter σ(βbi) we recover the measure of binary Boltzmann machines. However we choose here a prior that does not depend on the temperature a priori. We now derive the expansion for this general case following the outline discussed in section 4.2.1, and highlighting the differences with the binary case.

Note that inference in the generalized fully connected Boltzmann machine is somehow related to the symmetric rank-1 matrix factorization problem, which also features pairwise interactions. Similarly, inference for the bi-partite RBM maps to the asymmetric rank-1 matrix factorization. However, conversely to the Boltzmann inference, these factorizations are reconstruction problems. The mean-field techniques, derived in [LKZ16, LKZ17], allow there to compute the MMSE estimator of unknown signals from approximate marginals. Here we focus on the evaluation of the free energy.

Minimization for fixed marginals. While fixing the value of the first moment is sufficient for binary variables, more than one constraint is now needed in order to minimize the Gibbs free energy at a given value of the marginals. In the same spirit of the AMP algorithm we assume a Gaussian parametrization of the marginals. We note $\underline{a}$ the first moment of $\underline{x}$ and $\underline{c}$ its variance. We wish to compute the constrained minimum over the distributions q on ${\mathbb{R}}^{N}$

Equation (D.2)

where the notation of squared vectors corresponds here and below to the vectors of squared entries. It is equivalent to an unconstrained problem with Lagrange multipliers $\underline{\lambda }\left(\underline{a},\underline{c},\beta \right)$ and $\underline{\xi }\left(\underline{a},\underline{c},\beta \right)$

Equation (D.3)

The terms depending on the distribution q in the functional to minimize above can be interpreted as a Gibbs free energy for the effective energy functional

Equation (D.4)

The solution of the minimization problem (D.3) is therefore the corresponding Boltzmann distribution

Equation (D.5)

and the minimum $G\left(\underline{a},\underline{c}\right)$ is

Equation (D.6)

where the Lagrange multipliers $\underline{\lambda }\left(\underline{a},\underline{c},\beta \right)$ and $\underline{\xi }\left(\underline{a},\underline{c},\beta \right)$ enforcing the constraints are still implicit. Defining a functional $\tilde {G}$ for arbitrary vectors $\tilde {\underline{\lambda }}\in {\mathbb{R}}^{N}$ and $\underline{\tilde {\xi }}\in {\mathbb{R}}^{N}$,

Equation (D.7)

we have

Equation (D.8)

Equation (D.9)

Hence the Lagrange multipliers are identified as minimizers of $-\beta \tilde {G}$ and

Equation (D.10)

The true free energy $F=-\mathrm{log}\;\mathcal{Z}/\beta $ would eventually be recovered by minimizing the constrained minimum $G\left(\underline{a},\underline{c}\right)$ with respect to its arguments. Nevertheless, the computation of G and $\tilde {G}$ involves an integration over $\underline{x}\in {\mathbb{R}}^{N}$ and remains intractable. The following step of the Georges–Yedidia derivation consists in approximating these functionals by a Taylor expansion at infinite temperature where interactions are neutralized.

Expansion around β = 0. To perform the expansion we introduce the notation $A\left(\beta ,\underline{a},\underline{c}\right)=-\beta G\left(\underline{a},\underline{c}\right)$. We also define the auxiliary operator

Equation (D.11)

that allows to write concisely for any observable O the derivative of its average with respect to β,

Equation (D.12)

To compute the derivatives of $\underline{\lambda }$ and $\underline{\xi }$ with respect to β we note that

Equation (D.13)

Equation (D.14)

where we used that $\partial \tilde {G}/\partial \tilde {{\lambda }_{i}}=0$ and $\partial \tilde {G}/\partial {\tilde {\xi }}_{i}=0$ when evaluated for $\underline{\lambda }\left(\underline{a},\underline{c},\beta \right)$ and $\underline{\xi }\left(\underline{a},\underline{c},\beta \right)$. Consequently,

Equation (D.15)

We can now proceed to compute the first terms of the expansion that will be performed for the functional A.

Zeroth order. Substituting β = 0 in the definition of A we have

Equation (D.16)

with

Equation (D.17)

Equation (D.18)

At infinite temperature the interaction terms of the energy do not contribute so that the integral in ${\tilde {\mathcal{Z}}}^{0}$ factorizes and can be evaluated numerically in the event that it does not have a closed-form.

First order. We compute the derivative of A with respect to β. We use again that $\underline{\lambda }\left(\underline{a},\underline{c},\beta \right)$ and $\underline{\xi }\left(\underline{a},\underline{c},\beta \right)$ are stationary points of $\tilde {G}$ to write

Equation (D.19)

Equation (D.20)

Equation (D.21)

At infinite temperature the average over the product of variables becomes a product of averages so that we have

Equation (D.22)

Second order. Using the first order derivative of A we can compute the derivatives of the Lagrange parameters (D.15) and the auxillary operator at infinite temperature,

The second order derivative is then easily computed at infinite temperature

Equation (D.23)

Equation (D.24)

TAP free energy for the generalized Boltzmann machine. Stopping at the second order of the systematic expansion, and gathering the different terms derived above we have

Equation (D.25)

where the values of the parameters $\underline{\lambda }\left(0,\underline{a},\underline{c}\right)$ and $\underline{\xi }\left(0,\underline{a},\underline{c}\right)$ are implicitly defined through the stationary conditions (D.8) and (D.9). The TAP approximation of the free energy also requires to consider the stationary points of the expanded expression as a function of $\underline{a}$ and $\underline{c}$.

This second condition yields the relations

Equation (D.26)

Equation (D.27)

where we define new variables Ai and Bi. While the extremization with respect to the Lagrange multipliers gives

Equation (D.28)

Equation (D.29)

where we introduce update functions ${f}_{1}^{x}$ and ${f}_{2}^{x}$ with respect to the partition function

Equation (D.30)

Finally we can rewrite the TAP free energy as

Equation (D.31)

with the values of the parameters set by the self-consistency conditions (D.26)–(D.29), which are the TAP equations of the generalized Boltzmann machine at second order. Note that the naive mean-field equations are recovered by ignoring the second order terms in β2.

Relation to message passing. The TAP equations obtained above must correspond to the fixed points of the approximate message passing (AMP) following the derivation from belief propagation (BP) that is presented in section 4.3.3. In the appendix B of [TGM+18] the relaxed-BP equations are derived for the generalized Boltzmann machine:

Equation (D.32)

Equation (D.33)

To recover the TAP equations for them we define

Equation (D.34)

Equation (D.35)

As ${B}_{i}^{\left(t\right)}={B}_{i\to j}^{\left(t\right)}+\beta {W}_{ij}{a}_{j\to i}^{\left(t\right)}$ and ${A}_{i}^{\left(t\right)}={A}_{i\to j}^{\left(t\right)}-{\beta }^{2}{W}_{ij}^{2}{c}_{j\to i}^{\left(t\right)}$ we have by developing ${f}_{2}^{x}$ that${c}_{i}^{\left(t\right)}={c}_{i\to j}^{\left(t\right)}+O\left(\beta \right)$ so that

Equation (D.36)

By developing ${f}_{1}^{x}$ we also have

Equation (D.37)

Equation (D.38)

with $\frac{\partial {f}_{1}^{x}}{\partial {B}_{k}}\left({B}_{k}^{\left(t-1\right)},{A}_{k}^{\left(t-1\right)};{\theta }_{k}\right)={c}_{k}^{\left(t\right)}$. Finally, by replacing in the definition of Bi the messages we obtain

Equation (D.39)

As ${a}_{i\to k}^{\left(t-1\right)}={a}_{i}^{\left(t-1\right)}+O\left(\beta \right)$ and using the definition of ${A}_{i}^{\left(t\right)}$, we finally recover

Equation (D.40)

Hence we indeed recover the TAP equations as the AMP fixed points in (D.35), (D.36) and (D.40). Beyond the possibility to cross-check our results, the message passing derivation also specifies a scheme of updates to solve the self-consistency equations obtained by the Georges–Yedidia expansion. In the applications we consider below we should resort to this time indexing with good convergence properties [Bol14].

Solutions of the TAP equations. As already discussed in section 4.2, the TAP equations do not necessarily admit a single solution. In practice, different fixed points are reached when considering different initializations of the iteration of the self-consistent equations.

Appendix E.: Vector approximate message passing for the GLM

We recall here a possible derivation of G-VAMP discussed in section 4 (algorithm 2). We consider a projection of the BP equations for the factor graph figure 8.

Figure 8.

Figure 8. Factor graph representation of the GLM for the derivation of VAMP (reproduction of figure 5 to help following the derivation described here in the appendix).

Standard image High-resolution image

Gaussian assumptions. We start by parametrizing marginals as well as messages coming out of the Dirac factors. For a = 1, 2:

Equation (E.1)

and

Equation (E.2)

Equation (E.3)

Self consistency of the parametrizations at Dirac factor nodes. Around ψx the message passing equations are simply

Equation (E.4)

and similarly around ψz. Moreover, considering that messages are marginals to which the contribution of the opposite message is retrieved we have

Equation (E.5)

Equation (E.6)

Combining this observation along with (E.4) leads to updates (135) and (139). The same reasoning can be followed for the messages around ψz leading to updates (137) and (141).

Input and output update functions. The update functions of means and variances of the marginals are deduced from the parametrized message passing. For the variable ${\underline{x}}^{\left(1\right)}$ taking into account the prior px, the updates are very similar to GAMP input functions:

Equation (E.7)

Equation (E.8)

Equation (E.9)

Equation (E.10)

where ${\mathcal{Z}}_{x}^{\left(1\right)}$ is as usual the partition ensuring the normalization.

Similarly for the variable ${\underline{z}}^{\left(1\right)}$, the update functions are very similar to the GAMP output functions including the information coming from the observations:

Equation (E.11)

Equation (E.12)

Equation (E.13)

Equation (E.14)

Linear transformation. For the middle factor node we consider the vector variable concatenating $\overline{\underline{x}}=\left[{\underline{x}}^{\left(2\right)}{\underline{z}}^{\left(2\right)}\right]\in {\mathbb{R}}^{N+M}$. The computation of the corresponding marginal with the message passing then yields

Equation (E.15)

The means of ${\underline{x}}^{\left(2\right)}$ and ${\underline{z}}^{\left(2\right)}$ are then updated through

Equation (E.16)

at Δ → 0. At this point it is advantageous in terms of speed to consider the singular value decomposition $\underline{\underline{W}}=\underline{\underline{U}}\underline{\underline{S}}{\underline{\underline{V}}}^{\top }$ and to simplify the form of the variance matrices by taking them proportional to the identify, i.e. ${\underline{\underline{A}}}_{z}^{\left(2\right)}={A}_{z}^{\left(2\right)}{\underline{\underline{I}}}_{M}$ etc. Under this assumption the solution of the minimization problem is

Equation (E.17)

Equation (E.18)

with $\underline{\underline{D}}$ a diagonal matrix with entries ${D}_{ii}={\left({{A}_{z}^{\left(2\right)}}^{-1}{S}_{ii}^{2}+{{A}_{x}^{\left(2\right)}}^{-1}\right)}^{-1}$. The scalar variances are then updated using the traces of the Jacobians with respect to the ${\underline{B}}^{\left(2\right)}$s

Equation (E.19)

Equation (E.20)

Equation (E.21)

Equation (E.22)

Appendix F.: Multi-value AMP derivation for the GLM

We here present the derivation of the multi-value AMP and its SE motivated in section 5.3, focusing on the multi-value GLM. These derivations also appear in [GBKZ19].

F.1. Approximate message passing

The systematic procedure to write AMP for a given joint probability distribution consists in first writing BP on the factor graph, second project the messages on a parametrized family of functions to obtain the corresponding relaxed-BP and third close the equations on a reduced set of parameters by keeping only leading terms in the thermodynamic limit.

For the generic multi-value GLM the posterior measure we are interested in is

Equation (F.1)

where the known entries of matrix $\underline{\underline{W}}$ are drawn i.i.d. from a standard normal distribution (the scaling in $1/\sqrt{N}$ is here made explicit). The corresponding factor graph is given on figure 9. We are considering the simultaneous reconstruction of P signals ${\underline{x}}_{0,\left(k\right)}\in {\mathbb{R}}^{N}$ and therefore write the message passing on the variables ${\underline{x}}_{i}\in {\mathbb{R}}^{P}$. The major difference with the scalar version (P = 1) of AMP is that we will consider covariance matrices between variables coming from the P observations instead of scalar variances.

Figure 9.

Figure 9. Factor graph of the generalized linear model (GLM) on vector variables corresponding to the joint distribution (248).

Standard image High-resolution image

Belief propagation (BP). We start with BP on the factor graph of figure 9. For all pairs of index iμ, we define the update equations of messages function

Equation (F.2)

Equation (F.3)

where ${\mathcal{Z}}_{\mu \to i}$ and ${\mathcal{Z}}_{i\to \mu }$ are normalization function that allow to interpret messages as probability distributions. To improve readability, we drop the time indices in the following derivation, and only specify them in the final algorithm.

Relaxed BP (r-BP). The second step of the derivation is to develop messages keeping only terms up to order O(1/N) as we take the thermodynamic limit N → + (at fixed α = M/N). At this order, we will find that it is consistent to consider the messages to be approximately Gaussian, i.e. characterized by their means and co-variances. Thus we define

Equation (F.4)

Equation (F.5)

and

Equation (F.6)

Equation (F.7)

where ${\underline{\omega }}_{\mu \to i}$ and ${\underline{\underline{V}}}_{\mu \to i}$ are related to the intermediate variable ${\underline{z}}_{\mu }={\underline{w}}_{\mu }^{\top }\underline{\underline{X}}$.

Expansion of ${\tilde {m}}_{\mu \to i}$. We defined the Fourier transform ${\hat{p}}_{\text{out}}$ of ${p}_{\mathrm{o}\mathrm{u}\mathrm{t}}\left({\underline{y}}_{\mu }\vert {\underline{z}}_{\mu }\right)$ with respect to its argument ${\underline{z}}_{\mu }={\underline{w}}_{\mu }^{\top }\underline{\underline{X}}$,

Equation (F.8)

Using reciprocally the Fourier representation of ${p}_{\mathrm{o}\mathrm{u}\mathrm{t}}\left({\underline{y}}_{\mu }\vert {\underline{z}}_{\mu }\right)$,

Equation (F.9)

we decouple the integrals over the different ${\underline{x}}_{{i}^{\prime }}$ in (F.2),

Equation (F.10)

Equation (F.11)

where developing the exponentials of the product in (F.2) allows to express the integrals over the ${\underline{x}}_{{i}^{\prime }}$ as a function of the definitions (F.6) and (F.7), before re-exponentiating to obtain the final result (F.11). Now reversing the Fourier transform and performing the integral over $\underline{\xi }$ we can further rewrite

Equation (F.12)

Equation (F.13)

where we are led to introduce the output update functions,

Equation (F.14)

Equation (F.15)

Equation (F.16)

where $\mathcal{N}\left(\underline{z};\underline{\omega },\underline{\underline{V}}\right)$ is the multivariate Gaussian distribution of mean $\underline{\omega }$ and covariance $\underline{\underline{V}}$. Further expanding the exponential in (F.13) up to order O(1/N) leads to the Gaussian parametrization

Equation (F.17)

Equation (F.18)

with

Equation (F.19)

Equation (F.20)

Consistency with miμ. Inserting the Gaussian approximation of ${\tilde {m}}_{\mu \to i}$ in the definition of miμ, we get the parametrization

Equation (F.21)

with

Equation (F.22)

Equation (F.23)

Closing the equations. Ensuring the consistency with the definitions (F.4) and (F.5) of mean and covariance of miμ we finally close our set of equations by defining the input update functions,

Equation (F.24)

Equation (F.25)

Equation (F.26)

so that

Equation (F.27)

Equation (F.28)

The closed set of equations (F.6), (F.7), (F.18), (F.19), (F.21), (F.22), (F.26) and (F.27), with restored time indices, defines the r-BP algorithm. At convergence of the iterations, we obtain the approximated marginals

Equation (F.29)

with

Equation (F.30)

Equation (F.31)

As usual, while BP requires to follow iterations over M × N message distributions over vectors in ${\mathbb{R}}^{P}$, r-BP only requires to track O(M × N × P) variables, which is a great simplification. Nonetheless, r-BP can be further reduced to the more practical GAMP algorithm, given the scaling of the weights in $O\left(1/\sqrt{N}\right)$.

Approximate message passing. We define parameters ${\underline{\omega }}_{\mu }$, ${\underline{\underline{V}}}_{\mu }$ and ${\hat{\underline{x}}}_{i}$, ${\underline{\underline{{C}^{x}}}}_{i}$, likewise ${\underline{\lambda }}_{i}$ and ${\underline{\underline{\sigma }}}_{i}$ defined above and consider their relations to the original ${\underline{\lambda }}_{i\to \mu }$, ${\underline{\underline{\sigma }}}_{i\to \mu }$, ${\underline{\omega }}_{\mu \to i}$, ${\underline{\underline{V}}}_{\mu \to i}$, ${\hat{\underline{x}}}_{i\to \mu }$ and ${\underline{\underline{{C}^{x}}}}_{i\to \mu }$. As a result we obtain the vectorized AMP for the GLM presented in algorithm 3. Note that, similarly to GAMP, relaxing the Gaussian assumption on the weight matrix entries to any distribution with finite second moment yields the same algorithm using the Central Limit Theorem.

F.2. State Evolution

We consider the limit N → + at fixed α = M/N and a quenched average over the disorder (here the realizations of ${\underline{\underline{X}}}_{0}$, s0, $\underline{\underline{Y}}$ and $\underline{\underline{W}}$), to derive a state evolution analysis of the previously derived AMP. To this end, our starting point will be the r-BP equations.

F.2.1. State evolution derivation in mismatched prior and channel setting

Definition of the overlaps. The important quantities to follow the dynamic of iterations and fixed points of AMP are the overlaps. Here, they are the P × P matrices

Equation (F.32)

Output parameters. Under independent statistics of the entries of $\underline{\underline{W}}$ and under the assumption of independent incoming messages, the variable ${\underline{\omega }}_{\mu \to i}$ defined in (F.6) is a sum of independent variables and follows a Gaussian distribution by the central limit theorem. Its first and second moments are

Equation (F.33)

Equation (F.34)

Equation (F.35)

where we used the facts that the Wμi–s are independent with zero mean, and that Bμi, defined in (F.18), is of order $O\left(1/\sqrt{N}\right)$. Similarly, the variable ${\underline{z}}_{\mu \to i}={\sum }_{{i}^{\prime }\ne i}\frac{{W}_{\mu {i}^{\prime }}}{\sqrt{N}}{\underline{x}}_{{i}^{\prime }}$ is Gaussian with first and second moments

Equation (F.36)

Equation (F.37)

Furthermore, their covariance is

Equation (F.38)

Equation (F.39)

Equation (F.40)

Hence we find that for all μs and all is, ${\underline{\omega }}_{\mu \to i}$ and ${\underline{z}}_{\mu \to i}$ are approximately jointly Gaussian in the thermodynamic limit following a unique distribution $\mathcal{N}\left({\underline{z}}_{\mu \to i},{\underline{\omega }}_{\mu \to i};\;\underline{0},\;\underline{\underline{Q}}\right)$ with the block covariance matrix

Equation (F.41)

For the variance message ${\underline{\underline{V}}}_{\mu \to i}$, defined in (F.7), we have

Equation (F.42)

Equation (F.43)

where using the developments of ${\underline{\lambda }}_{i\to \mu }$ and ${\underline{\underline{\sigma }}}_{i\to \mu }$ (F.21) and (F.22), along with the scaling of ${\underline{B}}_{\mu \to i}$ in $O\left(1/\sqrt{N}\right)$ we replaced

Equation (F.44)

Futhermore, we can check that

Equation (F.45)

meaning that all ${\underline{\underline{V}}}_{\mu \to i}$ concentrate on their identical mean in the thermodynamic limit, which we note

Equation (F.46)

Input parameters. Here we use the re-parametrization trick to express ${\underline{y}}_{\mu }$ as a function g0(⋅) taking a noise ${\underline{{\epsilon}}}_{\mu }\sim {p}_{{\epsilon}}\left({\underline{{\epsilon}}}_{\mu }\right)$ as inputs: ${\underline{y}}_{\mu }={g}_{0}\left({\underline{w}}_{\mu }^{\top }{\underline{\underline{X}}}_{0},{\underline{{\epsilon}}}_{\mu }\right)$. Following (F.19), (F.18) and (F.28),

Equation (F.47)

Equation (F.48)

Equation (F.49)

The first term is again a sum of independent random variables, given the Wμi are i.i.d. with zero mean, of which the messages of type μi are assumed independent. The second term has non-zero mean and can be shown to concentrate. Finally recalling that all ${\underline{\underline{V}}}_{\mu \to i}$ also concentrate on $\underline{\underline{V}}$ we obtain the distribution

Equation (F.50)

with

Equation (F.51)

Equation (F.52)

For the inverse variance ${\underline{\underline{\sigma }}}_{i}^{-1}$ one can check again that it concentrates on its mean

Equation (F.53)

Equation (F.54)

Closing the equations. These statistics of the input parameters must ensure that consistently

Equation (F.55)

Equation (F.56)

Equation (F.57)

which gives upon expressing the computation of the expectations

Equation (F.58)

Equation (F.59)

Equation (F.60)

The state evolution analysis of the GLM on the vector variables finally consists in iterating alternatively the equations (F.50), (F.51), (F.53), and the equations (F.57), (F.58) (F.59) until convergence.

Performance analysis. The mean squared error (MSE) on the reconstruction of $\underline{\underline{X}}$ by the AMP algorithm is then predicted by

Equation (F.61)

where the scalar values used here correspond to the (unique) value of the diagonal elements of the corresponding overlap matrices. This MSE can be computed throughout the iterations of state evolution. Remarkably, the state evolution MSEs follow precisely the MSE of the cal-AMP predictors along the iterations of the algorithm provided the procedures are initialized consistently. A random initialization of ${\hat{\underline{x}}}_{i}$ in cal-AMP corresponds to an initialization of zero overlap m = 0, ν = 0, with variance of the priors q = q0 in the state evolution.

F.2.2. Bayes optimal state evolution

The SE equations can be greatly simplified in the Bayes optimal setting where the statistical model used by the student (priors px and ps, and channel pout) is known to match the teacher. In this case, the true unknown signal ${\underline{\underline{X}}}_{0}$ is in some sense statistically equivalent to the estimate $\underline{\underline{\hat{X}}}$ coming from the posterior. More precisely one can prove the Nishimori identities [Iba99, Nis01, OH91] (or [KKM+16] for a concise demonstration and discussion) implying that

Equation (F.62)

As a result the state evolution reduces to a set of two equations

Equation (F.63)

Equation (F.64)

with the block covariance matrix

Equation (F.65)
Please wait… references are loading.
10.1088/1751-8121/ab7f65