A Survey of Stochastic Simulation and Optimization Methods in Signal Processing

Modern signal processing (SP) methods rely very heavily on probability and statistics to solve challenging SP problems. SP methods are now expected to deal with ever more complex models, requiring ever more sophisticated computational inference techniques. This has driven the development of statistical SP methods based on stochastic simulation and optimization. Stochastic simulation and optimization algorithms are computationally intensive tools for performing statistical inference in models that are analytically intractable and beyond the scope of deterministic inference methods. They have been recently successfully applied to many difficult problems involving complex statistical models and sophisticated (often Bayesian) statistical inference techniques. This survey paper offers an introduction to stochastic simulation and optimization methods in signal and image processing. The paper addresses a variety of high-dimensional Markov chain Monte Carlo (MCMC) methods as well as deterministic surrogate methods, such as variational Bayes, the Bethe approach, belief and expectation propagation and approximate message passing algorithms. It also discusses a range of optimization methods that have been adopted to solve stochastic problems, as well as stochastic methods for deterministic optimization. Subsequently, areas of overlap between simulation and optimization, in particular optimization-within-MCMC and MCMC-driven optimization are discussed.


I. INTRODUCTION
Modern signal processing (SP) methods, (we use SP here to cover all relevant signal and image processing problems), rely very heavily on probabilistic and statistical tools; for example, they use stochastic models to represent the data observation process and the prior knowledge available, and they obtain solutions by performing statistical inference (e.g., using maximum likelihood or Bayesian strategies). Statistical SP methods are, in particular, routinely applied to many and varied tasks and signal modalities, ranging from resolution enhancement of medical images to hyperspectral image unmixing; from user rating prediction to change detection in social networks; and from source separation in music analysis to speech recognition.
However, the expectations and demands on the performance of such methods are constantly rising. SP methods are now expected to deal with challenging problems that require ever more complex models, and more importantly, ever more sophisticated computational inference techniques. This has driven the development of computationally intensive SP methods based on stochastic simulation and optimization. Stochastic simulation and optimization algorithms are computationally intensive tools for performing statistical inference in models that are analytically intractable and beyond the scope of deterministic inference methods. They have been recently successfully applied to many difficult SP problems involving complex statistical models and sophisticated (often Bayesian) statistical inference analyses. These problems can generally be formulated as inverse problems involving partially unknown observation processes and imprecise prior knowledge, for which they delivered accurate and insightful results. These stochastic algorithms are also closely related to the randomized, variational Bayes and message passing algorithms that are pushing forward the state of the art in approximate statistical inference for very large-scale problems. The key thread that makes stochastic simulation and optimization methods appropriate for these applications is the complexity and high dimensionality involved. For example in the case of hyperspectral imaging the data being processed can involve images of 2048 by 1024 pixels across up to hundreds or thousands of wavelengths.
This survey paper offers an introduction to stochastic simulation and optimization methods in signal and image processing. The paper addresses a variety of high-dimensional Markov chain Monte Carlo (MCMC) methods as well as deterministic surrogate methods, such as variational Bayes, the Bethe ap-proach, belief and expectation propagation and approximate message passing algorithms. It also discusses a range of stochastic optimization approaches. Subsequently, areas of overlap between simulation and optimization, in particular optimization-within-MCMC and MCMC-driven optimization are discussed. Some methods such as sequential Monte Carlo methods or methods based on importance sampling are not considered in this survey mainly due to space limitations. This paper seeks to provide a survey of a variety of the algorithmic approaches in a tutorial fashion, as well as to highlight the state of the art, relationships between the methods, and potential future directions of research. In order to set the scene and inform our notation, consider an unknown random vector of interest x = [x 1 , . . . , x N ] T and an observed data vector y = [y 1 , . . . , y M ] T , related to x by a statistical model with likelihood function p(y|x; θ) potentially parametrized by a deterministic vector of parameters θ. Following a Bayesian inference approach, we model our prior knowledge about x with a prior distribution p(x; θ), and our knowledge about x after observing y with the posterior distribution p(x|y; θ) = p(y|x; θ)p(x; θ) where the normalising constant is known as the "evidence", model likelihood, or the partition function. Although the integral in (2) suggests that all x j are continuous random variables, we allow any random variable x j to be either continuous or discrete, and replace the integral with a sum as required.
In many applications, we would like to evaluate the posterior p(x|y; θ) or some summary of it, for instance point estimates of x such as the conditional mean (i.e., MMSE estimate) E{x|y; θ}, uncertainty reports such as the conditional variance var{x|y; θ}, or expected log statistics as used in the expectation maximization (EM) algorithm [1][2][3] θ (i+1) = arg max θ E{ln p(x, y; θ)|y; θ (i) } where the expectation is taken with respect to p(x|y; θ (i) ). However, when the signal dimensionality N is large, the integral in (2), as well as those used in the posterior summaries, are often computationally intractable. Hence, the interest in computationally efficient alternatives. An alternative that has received a lot of attention in the statistical SP community is maximum-a-posteriori (MAP) estimation. Unlike other posterior summaries, MAP estimates can be computed by finding the value of x maximizing p(x|y; θ), which for many models is significantly more computationally tractable than numerical integration. In the sequel, we will suppress the dependence on θ in the notation, since it is not of primary concern. The paper is organized as follows. After this brief introduction where we have introduced the basic notation adopted, Section II discusses stochastic simulation methods, and in particular a variety of MCMC methods. In Section III we discuss deterministic surrogate methods, such as variational Bayes, the Bethe approach, belief and expectation propagation, and provide a brief summary of approximate message passing algorithms. Section IV discusses a range of optimization methods for solving stochastic problems, as well as stochastic methods for solving deterministic optimization problems. Subsequently, in Section V we discuss areas of overlap between simulation and optimization, in particular the use of optimization techniques within MCMC algorithms and MCMC-driven optimization, and suggest some interesting areas worthy of exploration. Finally, in Section VI we draw together thoughts, observations and conclusions.

II. STOCHASTIC SIMULATION METHODS
Stochastic simulation methods are sophisticated random number generators that allow samples to be drawn from a user-specified target density π(x), such as the posterior p(x|y). These samples can then be used, for example, to approximate probabilities and expectations by Monte Carlo integration [4,Ch. 3]. In this section we will focus on Markov chain Monte Carlo (MCMC) methods, an important class of stochastic simulation techniques that operate by constructing a Markov chain with stationary distribution π. In particular, we concentrate on Metropolis-Hastings algorithms for highdimensional models (see [5] for a more general recent review of MCMC methods). It is worth emphasizing, however, that we do not discuss many other important approaches to simulation that also arise often in signal processing applications, such as "particle filters" or sequential Monte Carlo methods [6,7], and approximate Bayesian computation [8].
A cornerstone of MCMC methodology is the Metropolis-Hastings (MH) algorithm [4, Ch. 7][9, 10], a universal machine for constructing Markov chains with stationary density π. Given some generic proposal distribution x * ∼ q(·|x), the generic MH algorithm proceeds as follows.

Algorithm 1 Metropolis-Hastings algorithm (generic version)
Set an initial state x (0) for t = 1 to T do Generate a candidate state x * from a proposal q(·|x (t−1) ) Compute the acceptance probability Accept the candidate and set x (t) = x * else Reject the candidate and set Under mild conditions on q, the chains generated by Algo. 1 are ergodic and converge to the stationary distribution π [11,12]. An important feature of this algorithm is that computing the acceptance probabilities ρ (t) does not require knowledge of the normalization constant of π (which is often not available in Bayesian inference problems). The intuition for the MH algorithm is that the algorithm proposes a stochastic perturbation to the state of the chain and applies a carefully defined decision rule to decide if this perturbation should be accepted or not. This decision rule, given by the random accept-reject step in Algo. 1, ensures that at equilibrium the samples of the Markov chain have π as marginal distribution.
The specific choice of q will of course determine the efficiency and the convergence properties of the algorithm. Ideally one should choose q = π to obtain a perfect sampler (i.e., with candidates accepted with probability 1); this is of course not practically feasible since the objective is to avoid the complexity of directly simulating from π. In the remainder of this section we review strategies for specifying q for high-dimensional models, and discuss relative advantages and disadvantages. In order to compare and optimize the choice of q, a performance criterion needs to be chosen. A natural criterion is the stationary integrated autocorrelation time for some relevant scalar summary statistic g : R N → R, i.e., with x (0) ∼ π, and where Cor(·, ·) denotes the correlation operator. This criterion is directly related to the effective number of independent Monte Carlo samples produced by the MH algorithm, and therefore to the mean square error of the resulting Monte Carlo estimates. Unfortunately drawing conclusions directly from (4) is generally not possible because τ g is highly dependent on the choice of g, with different choices often leading to contradictory results. Instead, MH algorithms are generally studied asymptotically in the infinitedimensional model limit. More precisely, in the limit N → ∞, the algorithms can be studied using diffusion process theory, where the dependence on g vanishes and all measures of efficiency become proportional to the diffusion speed. The "complexity" of the algorithms can then be defined as the rate at which efficiency deteriorates as N → ∞, e.g., O(N ) (see [13] for an introduction to this topic and details about the relationship between the efficiency of MH algorithms and their average acceptance probabilities or acceptance rates) 1 . Finally, it is worth mentioning that despite the generality of this approach, there are some specific models for which conventional MH sampling is not possible because the computation of ρ (t) is intractable (e.g., when π involves an intractable function of x, such as the partition function of the Potts-Markov random field). This issue has received a lot of attention in the recent MCMC literature, and there are now several variations of the MH construction for intractable models [8,[14][15][16].

A. Random walk Metropolis-Hastings algorithms
The so-called random walk Metropolis-Hastings (RWMH) algorithm is based on proposals of the form x * = x (t−1) + w, 1 Notice that this measure of complexity of MCMC algorithms does not take into account the computational costs related to generating candidate states and evaluating their Metropolis-Hastings acceptance probabilities, which typically also scale at least linearly with the problem dimension N . where typically w ∼ N (0, Σ) for some positive-definite covariance matrix Σ [4,Ch. 7.5]. This algorithm is one of the most widely used MCMC methods, perhaps because it has very robust convergence properties and a relatively low computational cost per iteration. It can be shown that the RWMH algorithm is geometrically ergodic under mild conditions on π [17]. Geometric ergodicity is important because it guarantees a central limit theorem for the chains, and therefore that the samples can be used for Monte Carlo integration. However, the myopic nature of the random walk proposal means that the algorithm often requires a large number of iterations to explore the parameter space, and will tend to generate samples that are highly correlated, particularly if the dimension N is large (the performance of this algorithm generally deteriorates at rate O(N ), which is worse than other more advanced stochastic simulation MH algorithms [18]). This drawback can be partially mitigated by adjusting the proposal matrix Σ to approximate the covariance structure of π, and some adaptive versions of RWHM perform this adaptation automatically. For sufficiently smooth target densities, performance is further optimized by scaling Σ to achieve an acceptance probability of approximately 20% − 25% [18].

B. Metropolis adjusted Langevin algorithms
The Metropolis adjusted Langevin algorithm (MALA) is an advanced MH algorithm inspired by the Langevin diffusion process on R N , defined as the solution to the stochastic differential equation [19] dX(t) = 1 2 ∇ log π (X(t)) dt + dW (t), where W is the Brownian motion process on R N and x 0 ∈ R N denotes some initial condition. Under appropriate stability conditions, X(t) converges in distribution to π as t → ∞, and is therefore potentially interesting for drawing samples from π. Since direct simulation from X(t) is only possible in very specific cases, we consider a discrete-time forward Euler approximation to (5) given by where the parameter δ controls the discrete-time increment. Under certain conditions on π and δ, (6) produces a good approximation of X(t) and converges to a stationary density which is close to π. In MALA this approximation error is corrected by introducing an MH accept-reject step that guarantees convergence to the correct target density π. The resulting algorithm is equivalent to Algo. 1 above, with proposal By analyzing the proposal (7) we notice that, in addition to the Langevin interpretation, MALA can also be interpreted as an MH algorithm that, at each iteration t, draws a candidate from a local quadratic approximation to log π around x (t−1) , with δ −1 I N as an approximation to the Hessian matrix. In addition, the MALA proposal can also be defined using a matrix-valued time step Σ. This modification is related to redefining (6) in an Euclidean space with the inner product w, Σ −1 x [20]. Again, the matrix Σ should capture the correlation structure of π to improve efficiency. For example, Σ can be the spectrally-positive version of the inverse Hessian matrix of log π [21], or the inverse Fisher information matrix of the statistical observation model [20]. Note that, in a similar fashion to preconditioning in optimization, using the exact full Hessian or Fisher information matrix is often too computationally expensive in high-dimensional settings and more efficient representations must be used instead. Alternatively, adaptive versions of MALA can learn a representation of the covariance structure online [22]. For sufficiently smooth target densities, MALA's performance can be further optimized by scaling Σ (or δ) to achieve an acceptance probability of approximately 50% − 60% [23].
Finally, there has been significant empirical evidence that MALA can be very efficient for some models, particularly in high-dimensional settings and when the cost of computing the gradient ∇ log π(x) is low. Theoretically, for sufficiently smooth π, the complexity of MALA scales at rate O(N 1/3 ) [23], comparing very favorably to the O(N ) rate of RWMH algorithms. However, the convergence properties of the conventional MALA are not as robust as those of the RWMH algorithm. In particular, MALA may fail to converge if the tails of π are super-Gaussian or heavy-tailed, or if δ is chosen too large [19]. Similarly, MALA might also perform poorly if π is not sufficiently smooth, or multi-modal. Improving MALA's convergence properties is an active research topic. Many limitations of the original MALA algorithm can now be avoided by using more advanced versions [20,[24][25][26][27].

C. Hamiltonian Monte Carlo
The Hamiltonian Monte Carlo (HMC) method is a very elegant and successful instance of an MH algorithm based on auxiliary variables [28]. Let w ∈ R N , Σ ∈ R N ×N positive definite, and consider the augmented target density π(x, w) ∝ π(x) exp(− 1 2 w T Σ −1 w), which admits the desired target density π(x) as marginal. The HMC method is based on the observation that the trajectories defined by the so-called Hamiltonian dynamics preserve the level sets of π(x, w). A point (x 0 , w 0 ) ∈ R 2N that evolves according to the differential equations (8) during some simulation time period (0, t] yields a point (x t , w t ) that verifies π(x t , w t ) = π(x 0 , w 0 ). In MCMC terms, the deterministic proposal (8) has π(x, w) as invariant distribution. Exploiting this property for stochastic simulation, the HMC algorithm combines (8) with a stochastic sampling step, w ∼ N (0, Σ), that also has invariant distribution π(x, w), and that will produce an ergodic chain. Finally, as with the Langevin diffusion (5), it is generally not possible to solve the Hamiltonian equations (8) exactly. Instead, we use a leap-frog approximation detailed in [28] w where again the parameter δ controls the discrete-time increment. The approximation error introduced by (9) is then corrected with an MH step targeting π(x, w). This algorithm is summarized in Algo. 2 below (see [28] for details about the derivation of the acceptance probability).

Algorithm 2 Hamiltonian Monte Carlo (with leap-frog)
Set an initial state x (0) , δ > 0, and L ∈ N. for t = 1 to T do Refresh the auxiliary variable w ∼ N (0, Σ). Generate a candidate (x * , w * ) by propagating the current state (x (t−1) , w) with L leap-frog steps of length δ defined in (9). Compute the acceptance probability Accept the candidate and set x (t) = x * . else Reject the candidate and set x (t) = x (t−1) . end if end for Note that to obtain samples from the marginal π(x) it is sufficient to project the augmented samples (x (t) , w (t) ) onto the original space of x (i.e., by discarding the auxiliary variables w (t) ). It is also worth mentioning that under some regularity condition on π(x), the leap-frog approximation (9) is time-reversible and volume-preserving, and that these properties are key to the validity of the HMC algorithm [28].
Finally, there has been a large body of empirical evidence supporting HMC, particularly for high-dimensional models. Unfortunately, its theoretical convergence properties are much less well understood [29]. It has been recently established that for certain target densities the complexity of HMC scales at rate O(N 1/4 ), comparing favorably with MALA's rate O(N 1/3 ) [30]. However, it has also been observed that, as with MALA, HMC may fail to converge if the tails of π are super-Gaussian or heavy-tailed, or if δ is chosen too large. HMC may also perform poorly if π is multi-modal, or not sufficiently smooth.
Of course, the practical performance of HMC also depends strongly on the algorithm parameters Σ, L and δ [29]. The covariance matrix Σ should be designed to model the correlation structure of π(x), which can be determined by performing pilot runs, or alternatively by using the strategies described in [20,21,31]. The parameters δ and L should be adjusted to obtain an average acceptance probability of approximately 60% − 70% [30]. Again, this can be achieved by performing pilot runs, or by using an adaptive HMC algorithm that adjusts these parameters automatically [32,33].

D. Gibbs sampling
The Gibbs sampler (GS) is another very widely used MH algorithm which operates by updating the elements of x individually, or by groups, using the appropriate conditional distributions [4,Ch. 10]. This divide-and-conquer strategy often leads to important efficiency gains, particularly if the conditional densities involved are "simple", in the sense that it is computationally straightforward to draw samples from them. For illustration, suppose that we split the elements of x in three groups x = (x 1 , x 2 , x 3 ), and that by doing so we obtain three conditional densities π(x 1 |x 2 , x 3 ), π(x 2 |x 1 , x 3 ), and π(x 3 |x 1 , x 2 ) that are "simple" to sample. Using this decomposition, the GS targeting π proceeds as in Algo. 3. Somewhat surprisingly, the Markov kernel resulting from concatenating the component-wise kernels admits π(x) as joint invariant distribution, and thus the chain produced by Algo. 3 has the desired target density (see [4,Ch. 10] for a review of the theory behind this algorithm). This fundamental property holds even if the simulation from the conditionals is done by using other MCMC algorithms (e.g., RWMH, MALA or HMC steps targeting the conditional densities), though this may result in a deterioration of the algorithm convergence rate. Similarly, the property also holds if the frequency and order of the updates is scheduled randomly and adaptively to improve the overall performance.

Algorithm 3 Gibbs sampler algorithm
Set an initial state

end for
As with other MH algorithms, the performance of the GS depends on the correlation structure of π. Efficient samplers seek to update simultaneously the elements of x that are highly correlated with each other, and to update "slowmoving" elements more frequently. The structure of π can be determined by pilot runs, or alternatively by using an adaptive GS that learns it online and that adapts the updating schedule accordingly as described in [34]. However, updating elements in parallel often involves simulating from more complicated conditional distributions, and thus introduces a computational overhead. Finally, it is worth noting that the GS is very useful for SP models, which typically have sparse conditional independence structures (e.g., Markovian properties) and conjugate priors and hyper-priors from the exponential family. This often leads to simple one-dimensional conditional distributions that can be updated in parallel by groups [16,35].

E. Partially collapsed Gibbs sampling
The partially collapsed Gibbs sampler (PCGS) is a recent development in MCMC theory that seeks to address some of the limitations of the conventional GS [36]. As mentioned previously, the GS performs poorly if strongly correlated elements of x are updated separately, as this leads to chains that are highly correlated and to an inefficient exploration of the parameter space. However, updating several elements together can also be computationally expensive, particularly if it requires simulating from difficult conditional distributions. In collapsed samplers, this drawback is addressed by carefully replacing one or several conditional densities by partially collapsed, or marginalized conditional distributions.
For illustration, suppose that in our previous example the subvectors x 1 and x 2 exhibit strong dependencies, and that as a result the GS of Algo. 3 performs poorly. Assume that we are able to draw samples from the marginalized conditional density π(x 1 |x 3 ) = π(x 1 , x 2 |x 3 )dx 2 , which does not depend on x 2 . This leads to the PCGS described in Algo. 4 to sample from π, which "partially collapses" Algo. 3 by replacing π(x 1 |x 2 , x 3 ) with π(x 1 |x 3 ).

Algorithm 4 Partially collapsed Gibbs sampler
Set an initial state

end for
Van Dyk and Park [36] established that the PCGS is always at least as efficient as the conventional GS, and it has been observed that that the PCGS is remarkably efficient for some statistical models [37,38]. Unfortunately, PCGSs are not as widely applicable as GSs because they require simulating exactly from the partially collapsed conditional distributions. In general, using MCMC simulation (e.g., MH steps) within a PCGS will lead to an incorrect MCMC algorithm [39]. Similarly, altering the order of the updates (e.g., by permuting the simulations of x 1 and x 2 in Algo. 4) may also alter the target density [36].

III. SURROGATES FOR STOCHASTIC SIMULATION A. Variational Bayes
In the variational Bayes (VB) approach described in [40,41], the true posterior p(x|y) is approximated by a density q (x) ∈ Q, where Q is a subset of valid densities on x. In particular, where D(q p) denotes the Kullback-Leibler (KL) divergence between p and q. As a result of the optimization in (10) over a function, this is termed "variational Bayes" because of the relation to the calculus of variations [42]. Recalling that D(q p) reaches its minimum value of zero if and only if p = q [43], we see that q (x) = p(x|y) when Q includes all valid densities on x. However, the premise is that p(x|y) is too difficult to work with, and so Q is chosen as a balance between fidelity and tractability. Note that the use of D(q p), rather than D(p q), implies a search for a q that agrees with the true posterior p(x|y) over the set of x where p(x|y) is large. We will revisit this choice when discussing expectation propagation in Section III-E.
Rather than working with the KL divergence directly, it is common to decompose it as follows is known as the Gibbs free energy or variational free energy.
Rearranging (11), we see that as a consequence of D q(x) p(x|y) ≥ 0. Thus, F (q) can be interpreted as an upper bound on the negative log partition. Also, because ln Z is invariant to q, the optimization (10) can be rewritten as which avoids the difficult integral in (2). In the sequel, we will discuss several strategies to solve the variational optimization problem (14).

B. The mean-field approximation
A common choice of Q is the set of fully factorizable densities, resulting in the mean-field approximation [44,45] Substituting (15) into (12) yields the mean-field free energy where h(q j ) − q j (x j ) ln q j (x j ) dx j denotes the differential entropy. Furthermore, for j = 1, . . . , N , equation (16) can be written as where x \j [x 1 , . . . , x j−1 , x j , . . . , x N ] T for j = 1, . . . , N . Equation (17) implies the optimality condition where (19) suggests an iterative coordinate-ascent algorithm: update each component q j (x j ) of q(x) while holding the others fixed. But this requires solving the integral in (18). A solution arises when ∀j the conditionals p(x j , y|x \j ) belong to the same exponential family of distributions [46], i.e., where the sufficient statistic t(x j ) parameterizes the family. The exponential family encompasses a broad range of distributions, notably jointly Gaussian and multinomial p(x, y). y) and (20) into (18) immediately gives where the expectation is taken over (19) reduces to the momentmatching condition where γ j, is the optimal value of γ j .

C. The Bethe approach
In many cases, the fully factored model (15) yields too gross of an approximation. As an alternative, one might try to fit a model q(x) that has a similar dependency structure as p(x|y).
In the sequel, we assume that the true posterior factors as where x α are subvectors of x (sometimes called cliques or outer clusters) and f α are non-negative potential functions. Note that the factorization (23) defines a Gibbs random field when p(x|y) > 0. When a collection of variables {x n } always appears together in a factor, we can collect them into x β , an inner cluster, although it is not necessary to do so. For simplicity we will assume that these x β are non-overlapping (i.e., x β ∩ x β = 0 ∀β = β ), so that {x β } represents a partition of x. The factorization (23) can then be drawn as a factor graph to help visualize the structure of the posterior, as in Figure 1.
x β Fig. 1. An example of a factor graph, which is a bipartite graph consisting of variable nodes, (circles/ovals), and factor nodes, (boxes). In this example, There are several choices for the inner clusters x β . One is the full factorization , and x 4 = x 4 . Another is the partial factorization which results in the "super node" in the dashed oval. Another is no factorization: in the "super node" in the dotted oval. In the latter case, we redefine each factor fα to have the full domain x (with trivial dependencies where needed).
We now seek a tractable way to build an approximation q(x) with the same dependency structure as (23). But rather than designing q(x) as a whole, we design the cluster marginals, {q α (x α )} and {q β (x β )}, which must be non-negative, normalized, and consistent where x α\β gathers the components of x that are contained in the cluster α and not in the cluster β, and N α denotes the neighborhood of the factor α (i.e., the set of inner clusters β connected to α).
In general, it is difficult to specify q(x) from its cluster marginals. However, in the special case that the factor graph has a tree structure (i.e., there is at most one path from one node in the graph to another), we have [47] q (27) where N β = |N β | is the neighborhood size of the cluster β. In this case, the free energy (12) simplifies to where F B is known as the Bethe free energy (BFE) [47].
Clearly, if the true posterior {f α } has a tree structure, and no constraints beyond (24)- (26) are placed on the cluster marginals {q α }, {q β }, then minimization of F B {q α }, {q β } will recover the cluster marginals of the true posterior. But even when {f α } is not a tree, F B {q α }, {q β } can be used as an approximation of the Gibbs free energy F (q), and minimizing F B can be interpreted as designing a q that locally matches the true posterior.
The remaining question is how to efficiently minimize F B {q α }, {q β } subject to the (linear) constraints (24)- (26). Complicating matters is the fact that F B {q α }, {q β } is the sum of convex KL divergences and concave entropies. One option is direct minimization using a "double loop" approach like the concave-convex procedure (CCCP) [48], where the outer loop linearizes the concave term about the current estimate and the inner loop solves the resulting convex optimization problem (typically with an iterative technique). Another option is belief propagation, which is described below.

D. Belief propagation
Belief propagation (BP) [49,50] is an algorithm for computing (or approximating) marginal probability density functions (pdfs) 2 like q β (x β ) and q α (x α ) by propagating messages on 2 Note that another form of BP exists to compute the maximum a posteriori (MAP) estimate arg maxx p(x|y) known as the "max-product" or "minsum" algorithm [50]. However, this approach does not address the problem of computing surrogates for stochastic methods, and so is not discussed further. a factor graph. The standard form of BP is given by the sumproduct algorithm (SPA) [51], which computes the following messages from each factor node f α to each variable (super) node x β and vice versa These messages are then used to compute the beliefs which must be normalized in accordance with (25). The messages (29)- (30) do not need to be normalized, although it is often done in practice to prevent numerical overflow. When the factor graph {f α } has a tree structure, the BPcomputed marginals coincide with the true marginals after one round of forward and backward message passes. Thus, BP on a tree-graph is sometimes referred to as the forward-backward algorithm, particularly in the context of hidden Markov models [52]. In the tree case, BP is akin to a dynamic programming algorithm that organizes the computations needed for marginal evaluation in a tractable manner.
When the factor graph {f α } has cycles or "loops," BP can still be applied by iterating the message computations (29)-(30) until convergence (not guaranteed), which is known as loopy BP (LBP). However, the corresponding beliefs (31)- (32) are in general only approximations of the true marginals. This suboptimality is expected because exact marginal computation on a loopy graph is an NP-hard problem [53]. Still, the answers computed by LBP are in many cases very accurate [54]. For example, LBP methods have been successfully applied to communication and SP problems such as: turbo decoding [55], LDPC decoding [49,56], inference on Markov random fields [57], multiuser detection [58], and compressive sensing [59,60].
Although the excellent performance of LBP was at first a mystery, it was later established that LBP minimizes the constrained BFE. More precisely, the fixed points of LBP coincide with the stationary points of F B {q α }, {q β } from (28) under the constraints (24)-(26) [47]. The link between LBP and BFE can be established through the Lagrangian formalism, which converts constrained BFE minimization to an unconstrained minimization through the incorporation of additional variables known as Lagrange multipliers [61]. By setting the derivatives of the Lagrangian to zero, one obtains a set of equations that are equivalent to the message updates (29)-(30) [47]. In particular, the stationary-point versions of the Lagrange multipliers equal the fixed-point versions of the loopy SPA log-messages.
Note that, unlike the mean-field approach (15), the clusterbased nature of LBP does not facilitate an explicit description of the joint-posterior approximation q ∈ Q from (10). The reason is that, when the factor graph is loopy, there is no straightforward relationship between the joint posterior q(x) and the cluster marginals {q α (x α )}, {q β (x β )}, as explained before (27). Instead, it is better to interpret LBP as an efficient implementation of the Bethe approach from Section III-C, which aims for a local approximation of the true posterior.
In summary, by constructing a factor graph with lowdimensional {x β } and applying BP or LBP, we trade the highdimensional integral q β (x β ) = p(x|y) dx \β for a sequence of low-dimensional message computations (29)- (30). But (29)-(30) are themselves tractable only for a few families of {f α }. Typically, {f α } are limited to members of the exponential family closed under marginalization (see [62]), so that the updates of the message pdfs (29)-(30) reduce to updates of the natural parameters (i.e., η in (20)). The two most common instances are multivariate Gaussian pdfs and multinomial probability mass functions (pmfs). For both of these cases, when LBP converges, it tends to be much faster than doubleloop algorithms like CCCP (see, e.g., [63]). However, LBP does not always converge [54].

E. Expectation propagation
Expectation propagation (EP) [64] (see also the overviews [62,65]) is an iterative method of approximate inference that is reminiscent of LBP but has much more flexibility with regards to the modeling distributions. In EP, the true posterior p(x|y), which is assumed to factorize as in (23), is approximated by where x α are the same as in (23) and m α are referred to as "site approximations." Although no constraints are imposed on the true-posterior factors {f α }, the approximation q(x) is restricted to a factorized exponential family. In particular, with some given base measure. We note that our description of EP applies to arbitrary partitions {x β }, from the trivial partition x β = x to the full partition x β = x β . The EP algorithm iterates the following updates over all factors α until convergence (not guaranteed) where Q in (38) refers to the set of q(x) obeying (34)- (35). Essentially, (36) removes the αth site approximation m α from the posterior model q, and (37) replaces it with the true factor f α . Here, q \α is know as the "cavity" distribution. The quantity q \α is then projected onto the exponential family in (38).
The site approximation is then updated in (39), and the old quantities are overwritten in (40)- (41). Note that the right side of (39) depends only on x α because q new (x) and q \α (x) differ only over x α . Note also that the KL divergence in (38) is reversed relative to (10). The EP updates (37)- (41) can be simplified by leveraging the factorized exponential family structure in (34)- (35). First, for (33) to be consistent with (34)- (35), each site approximation must factor into exponential-family terms, i.e., It can then be shown [62] that (36)-(38) reduce to for all β ∈ N α , which can be interpreted as the moment match- for all β ∈ N α . Finally, (40) and (41) reduce to γ β ← γ new β and µ α,β ← µ new α,β , respectively, for all β ∈ N α . Interestingly, in the case that the true factors {f α } are members of an exponential family closed under marginalization, the version of EP described above is equivalent to the SPA up to a change in message schedule. In particular, for each given factor node α, the SPA updates the outgoing message towards one variable node β per iteration, whereas EP simultaneously updates the outgoing messages in all directions, resulting in m new α (see, e.g., [41]). By restricting the optimization in (39) to a single factor q new β , EP can be made equivalent to the SPA. On the other hand, for generic factors {f α }, EP can be viewed as a tractable approximation of the (intractable) SPA.
Although the above form of EP iterates serially through the factor nodes α, it is also possible to perform the updates in parallel, resulting in what is known as the expectation consistent (EC) approximation algorithm [66].
EP and EC have an interesting BFE interpretation. Whereas the fixed points of LBP coincide with the stationary points of the BFE (28) subject to (24)- (25) and strong consistency (26), the fixed points of EP and EC coincide with the stationary points of the BFE (28) subject to (24)-(25) and the weak consistency (i.e., moment-matching) constraint [67] EP, like LBP, is not guaranteed to converge. Hence, provably convergence double-loop algorithms have been proposed that directly minimize the weakly constrained BFE, e.g., [67].

F. Approximate message passing
So-called approximate message passing (AMP) algorithms [59,60] have recently been developed for the separable generalized linear model where the prior p(x) is fully factorizable, as is the conditional pdf p(y|z) relating the observation vector y to the (hidden) transform output vector z Ax, where A [a 1 , ..., a M ] T ∈ R M ×N is a known linear transform. Like EP, AMP allows tractable inference under generic 3 p x and p y|z . AMP can be derived as an approximation of LBP on the factor graph constructed with inner clusters x β = x β for β = 1, . . . , N , with outer clusters x α = x for α = 1, . . . , M and x α = x α−M for α = M + 1, . . . , M + N , and with factors In the large-system limit (LSL), i.e., M, N → ∞ for fixed ratio M/N , the LBP beliefs q β (x β ) simplify to where { r β } N β=1 and τ are iteratively updated parameters. Similarly, for m = 1, . . . , M , the belief on z m , denoted by q z,m (·), simplifies to where { p m } M m=1 and ν are iteratively updated parameters. Each AMP iteration requires only one evaluation of the mean and variance of (49)-(50), one multiplication by A and A T , and relatively few iterations, making it very computationally efficient, especially when these multiplications have fast implementations (e.g., using fast Fourier transforms and discrete wavelet transforms ).
In the LSL under i.i.d sub-Gaussian A, AMP is fully characterized by a scalar state evolution (SE). When this SE has a unique fixed point, the marginal posterior approximations (49)-(50) are known to be exact [68,69].
For generic A, AMP's fixed points coincide with the stationary points of an LSL version of the BFE [70,71]. When AMP converges, its posterior approximations are often very accurate (e.g., [72]), but AMP does not always converge. In the special case of Gaussian likelihood p y|z and prior p x , AMP convergence is fully understood: convergence depends on the ratio of peak-to-average squared singular values of A, and convergence can be guaranteed for any A with appropriate damping [73]. For generic p x and p y|z , damping greatly helps convergence [74] but theoretical guarantees are lacking. A double-loop algorithm to directly minimize the LSL-BFE was recently proposed and shown to have global convergence for strictly log-concave p x and p y|z under generic A [75].

A. Optimization problem
The Monte Carlo methods described in Section II provide a general approach for estimating reliably posterior probabilities and expectations. However, their high computational cost often makes them unattractive for applications involving very high dimensionality or tight computing time constraints. One alternative strategy is to perform inference approximately 3 More precisely, the AMP algorithm [59] handles Gaussian p y|z while the generalized AMP (GAMP) algorithm [60] handles arbitrary p y|z . by using deterministic surrogates as described in Section III. Unfortunately, these faster inference methods are not as generally applicable, and because they rely on approximations, the resulting inferences can suffer from estimation bias. As already mentioned, if one focuses on the MAP estimator, efficient optimization techniques can be employed, which are often more computationally tractable than MCMC methods and, for which strong guarantees of convergence exist. In many SP applications, the computation of the MAP estimator of x can be formulated as an optimization problem having the following form minimize x∈R N ϕ(Hx, y) + g(Dx) (51) where ϕ : H ∈ R M ×N , and D ∈ R P ×N with P ∈ N * . For example, H may be a linear operator modeling a degradation of the signal of interest, y a vector of observed data, ϕ a leastsquares criterion corresponding to the negative log-likelihood associated with an additive zero-mean white Gaussian noise, g a sparsity promoting measure, e.g., an 1 norm, and D a frame analysis transform or a gradient operator. Often, ϕ is an additively separable function, i.e., where z = [z 1 , ..., z M ] T . Under this condition, the previous optimization problem becomes an instance of the more general stochastic one involving the expectation where j, y, and H are now assumed to be random variables and the expectation is computed with respect to the joint distribution of (j, y, H), with h T j the j-th line of H. More precisely, when (52) holds, (51) is then a special case of (53) with j uniformly distributed over {1, . . . , M } and (y, H) deterministic. Conversely, it is also possible to consider that j is deterministic and that for every i ∈ {2, . . . , M }, ϕ i = ϕ 1 , and (y i , h i ) 1≤i≤M are identically distributed random variables. In this second scenario, because of the separability condition (52), the optimization problem (51) can be regarded as a proxy for (53), where the expectation Φ(x) is approximated by a sample estimate (or stochastic approximation under suitable mixing assumptions). All these remarks illustrate the existing connections between problems (51) and (53). Note that the stochastic optimization problem defined in (53) has been extensively investigated in two communities: machine learning, and adaptive filtering, often under quite different practical assumptions on the forms of the functions (ϕ j ) j≥1 and g. In machine learning [76][77][78], x indeed represents the vector of parameters of a classifier which has to be learnt, whereas in adaptive filtering [79,80], it is generally the impulse response of an unknown filter which needs to be identified and possibly tracked. In order to simplify our presentation, in the rest of this section, we will assume that the functions (ϕ j ) j≥1 are convex and Lipschitz-differentiable with respect to their first argument (for example, they may be logistic functions).

B. Optimization algorithms for solving stochastic problems
The main difficulty arising in the resolution of the stochastic optimization problem (53) is that the integral involved in the expectation term often cannot be computed in practice since it is generally high-dimensional and the underlying probability measure is usually unknown. Two main computational approaches have been proposed in the literature to overcome this issue. The first idea is to approximate the expected loss function by using a finite set of observations and to minimize the associated empirical loss (51). The resulting deterministic optimization problem can then be solved by using either deterministic or stochastic algorithms, the latter being the topic of Section IV-C. Here, we focus on the second family of methods grounded in stochastic approximation techniques to handle the expectation in (54). More precisely, a sequence of identically distributed samples (y j , h j ) j≥1 is drawn, which are processed sequentially according to some update rule. The iterative algorithm aims to produce a sequence of random iterates (x j ) j≥1 converging to a solution to (53).
We begin with a group of online learning algorithms based on extensions of the well-known stochastic gradient descent (SGD) approach. Then we will turn our attention to stochastic optimization techniques developed in the context of adaptive filtering.
1) Online learning methods based on SGD: Let us assume that an estimate u j ∈ R N of the gradient of Φ at x j is available at each iteration j ≥ 1 A popular strategy for solving (53) in this context leverages the gradient estimates to derive a so-called stochastic forward-backward (SFB) scheme, (also sometimes called stochastic proximal gradient algorithm) where (γ j ) j≥1 is a sequence of positive stepsize values and (λ j ) j≥1 is a sequence of relaxation parameters in ]0, 1]. Hereabove, prox ψ (v) denotes the proximity operator at v ∈ R N of a lower-semicontinuous convex function ψ : R N →] − ∞, +∞] with nonempty domain, i.e., the unique minimizer of ψ + 1 2 · −v 2 (see [81] and the references therein), and g • D(x) = g(Dx). A convergence analysis of the SFB scheme has been conducted in [82][83][84][85], under various assumptions on the functions Φ, g, on the stepsize sequence, and on the statistical properties of (u j ) j≥1 . For example, if x 1 is set to a given (deterministic) value, the sequence (x j ) j≥1 generated by (55) is guaranteed to converge almost surely to a solution of Problem (53) under the following technical assumptions [84] (i) Φ has a β −1 -Lipschitzian gradient with β ∈]0, +∞[, g is a lower-semicontinuous convex function, and Φ + g • D is strongly convex.
When g ≡ 0, the SFB algorithm in (55) becomes equivalent to SGD [86][87][88][89]. According to the above result, the convergence of SGD is ensured as soon as j≥1 λ j γ j = +∞ and j≥1 λ j γ 2 j < +∞. In the unrelaxed case defined by λ j ≡ 1, we then retrieve a particular case of the decaying condition γ j ∝ j −1/2−δ with δ ∈]0, 1/2] usually imposed on the stepsize in the convergence studies of SGD under slightly different assumptions on the gradient estimates (u j ) j≥1 (see [90,91] for more details). Note also that better convergence properties can be obtained, if a Polyak-Ruppert averaging approach is performed, i.e., the averaged sequence (x j ) j≥1 , defined as x i for every j ≥ 1, is considered instead of (x j ) j≥1 in the convergence analysis [90,92].
We now comment on approaches related to SFB that have been proposed in the literature to solve (53). It should first be noted that a simple alternative strategy to deal with a possibly nonsmooth term g is to incorporate a subgradient step into the previously mentioned SGD algorithm [93]. However, this approach, like its deterministic version, may suffer from a slow convergence rate [94]. Another family of methods, close to SFB, adopt the regularized dual averaging (RDA) strategy, first introduced in [94]. The principal difference between SFB and RDA methods is that the latter rely on iterative averaging of the stochastic gradient estimates, which consists of replacing in the update rule (55), (u j ) j≥1 by (u j ) j≥1 where, for every j ≥ 1, u j = 1 j j i=1 u i . The advantage is that it provides convergence guarantees for nondecaying stepsize sequences. Finally, the so-called composite mirror descent methods, introduced in [95], can be viewed as extended versions of the SFB algorithm where the proximity operator is computed with respect to a non Euclidean distance (typically, a Bregman divergence).
In the last few years, a great deal of effort has been made to modify SFB when the proximity operator of g • D does not have a simple expression, but when g can be split into several terms whose proximity operators are explicit. We can mention the stochastic proximal averaging strategy from [96], the stochastic alternating direction method of mutipliers (ADMM) from [97][98][99] and the alternating block strategy from [100] suited to the case when g • D is a separable function.
Another active research area addresses the search for strategies to improve the convergence rate of SFB. Two main approaches can be distinguished in the literature. The first, adopted for example in [83,[101][102][103], relies on subspace acceleration. In such methods, usually reminiscent of Nesterov's acceleration techniques in the deterministic case, the convergence rate is improved by using information from previous iterates for the construction of the new estimate. Another efficient way to accelerate the convergence of SFB is to incorporate in the update rule second-order information one may have on the cost functions. For instance, the method described in [104] incorporates quasi-Newton metrics into the SFB and RDA algorithms, and the natural gradient method from [105] can be viewed as a preconditioned SGD algorithm. The two strategies can be combined, as for example, in [106].
2) Adaptive filtering methods: In adaptive filtering, stochastic gradient-like methods have been quite popular for a long time [107,108]. In this field, the functions (ϕ j ) j≥1 often reduce to a least squares criterion where x is the unknown impulse response. However, a specific difficulty to be addressed is that the designed algorithms must be able to deal with dynamical problems the optimal solution of which may be time-varying due to some changes in the statistics of the available data. In this context, it may be useful to adopt a multivariate formulation by imposing, at each where y j = [y j , . . . , y j−Q+1 ] T , H j = [h j , . . . , h j−Q+1 ] T , and Q ≥ 1. This technique, reminiscent of mini-batch procedures in machine learning, constitutes the principle of affine projection algorithms, the purpose of which is to accelerate the convergence speed [109]. Our focus now switches to recent work which aims to impose some sparse structure on the desired solution.
A simple method for imposing sparsity is to introduce a suitable adaptive preconditioning strategy in the stochastic gradient iteration, leading to the so-called proportionate least mean square method [110,111], which can be combined with affine projection techniques [112,113]. Similarly to the work already mentioned that has been developed in the machine learning community, a second approach proceeds by minimizing penalized criteria such as (53) where g is a sparsity measure and D = I N . In [114,115], zeroattracting algorithms are developed which are based on the stochastic subgradient method. These algorithms have been further extended to affine projection techniques in [116][117][118]. Proximal methods have also been proposed in the context of adaptive filtering, grounded on the use of the forward-backward algorithm [119], an accelerated version of it [120], or primal-dual approaches [121]. It is interesting to note that proportionate affine projection algorithms can be viewed as special cases of these methods [119]. Other types of algorithms have been proposed which provide extensions of the recursive least squares method, which is known for its fast convergence properties [106,122,123]. Instead of minimizing a sparsity promoting criterion, it is also possible to formulate the problem as a feasibility problem where, at iteration j ≥ Q, one searches for a vector x satisfying both denotes the (possibly weighted) 1 norm and (η, ρ) ∈]0, +∞[ 2 .
Over-relaxed projection algorithms allow such kind of problems to be solved efficiently [124,125].

C. Stochastic algorithms for solving deterministic optimization problems
We now consider the deterministic optimization problem defined by (51) and (52). Of particular interest is the case when the dimensions N and/or M are very large (for instance, in [126], M = 2500000 and in [127], N = 100250).
1) Incremental gradient algorithms: Let us start with incremental methods, which are dedicated to the solution of (51) when M is large, so that one prefers to exploit at each iteration a single term ϕ j , usually through its gradient, rather than the global function ϕ. There are many variants of incremental algorithms, which differ in the assumptions made on the functions involved, on the stepsize sequence, and on the way of activating the functions (ϕ i ) 1≤i≤M . This order could follow either a deterministic [128] or a randomized rule. However, it should be noted that the use of randomization in the selection of the components presents some benefits in terms of convergence rates [129] which are of particular interest in the context of machine learning [130,131], where the user can only afford few full passes over the data. Among randomized incremental methods, the SAGA algorithm [132], presented below, allows the problem defined in (51) to be solved when the function g is not necessarily smooth, by making use of the proximity operator introduced previously. The n-th iteration of SAGA reads as u n = h jn ∇ϕ jn (h T jn x n , y jn ) − h jn ∇ϕ jn (h T jn z jn,n , y jn ) where γ ∈]0, +∞[, for all i∈{1, . . . , M }, z i,1 = x 1 ∈ R N , and j n is drawn from an i.i.d. uniform distribution on {1, . . . , M }. Note that, although the storage of the variables (z i,n ) 1≤i≤M can be avoided in this method, it is necessary to store the M gradient vectors h i ∇ϕ i (h T i z i,n , y i ) 1≤i≤M . The convergence of Algorithm (58) has been analyzed in [132]. If the functions (ϕ i ) 1≤ı≤M are β −1 -Lipschitz differentiable and µ-strongly convex with (β, µ) ∈]0, +∞[ 2 and the stepsize γ equals β/(2(µβM + 1)), then (E { x n − x 2 }) n∈N goes to zero geometrically with rate 1 − γ, where x is the solution to Problem (51). When only convexity is assumed, a weaker convergence result is available.
The relationship between Algorithm (58) and other stochastic incremental methods existing in the literature is worthy of comment. The main distinction arises in the way of building the gradient estimates (u n ) n≥1 . The standard incremental gradient algorithm, analyzed for instance in [129], relies on simply defining, at iteration n, u n = h jn ∇ϕ jn (h T jn x n , y jn ). However, this approach, while leading to a smaller computational complexity per iteration and to a lower memory requirement, gives rise to suboptimal convergence rates [91,129], mainly due to the fact that its convergence requires a stepsize sequence (γ n ) n≥1 decaying to zero. Motivated by this observation, much recent work [126,[130][131][132][133][134] has been dedicated to the development of fast incremental gradient methods, which would benefit from the same convergence rates as batch optimization methods, while using a randomized incremental approach. A first class of methods relies on a variance reduction approach [130,[132][133][134] which aims at diminishing the variance in successive estimates (u n ) n≥1 . All of the aforementioned algorithms are based on iterations which are similar to (58). In the stochastic variance reduction gradient method and the semi-stochastic gradient descent method proposed in [133,134], a full gradient step is made at every K iterations, K ≥ 1, so that a single vector z n is used instead of (z i,n ) 1≤i≤M in the update rule. This so-called mini-batch strategy leads to a reduced memory requirement at the expense of more gradient evaluations. As pointed out in [132], the choice between one strategy or another may depend on the problem size and on the computer architecture. In the stochastic average gradient algorithm (SAG) from [130], a multiplicative factor 1/M is placed in front of the gradient differences, leading to a lower variance counterbalanced by a bias in the gradient estimates. It should be emphasized that the work in [130,133] is limited to the case when g ≡ 0.
A second class of methods, closely related to SAG, consists of applying the proximal step to z n − γu n , where z n is the average of the variables (z i,n ) 1≤i≤M (which thus need to be stored). This approach is retained for instance in the Finito algorithm [131] as well as in some instances of the minimization by incremental surrogate optimization (MISO) algorithm, proposed in [126]. These methods are of particular interest when the extra storage cost is negligible with respect to the high computational cost of the gradients. Note that the MISO algorithm relying on the majoration-minimization framework employs a more generic update rule than Finito and has proven convergence guarantees even when g is nonzero.
2) Block coordinate approaches: In the spirit of the Gauss-Seidel method, an efficient approach for dealing with Problem (51) when N is large consists of resorting to block coordinate alternating strategies. Sometimes, such a block alternation can be performed in a deterministic manner [135,136]. However, many optimization methods are based on fixed point algorithms, and it can be shown that with deterministic block coordinate strategies, the contraction properties which are required to guarantee the convergence of such algorithms are generally no longer satisfied. In turn, by resorting to stochastic techniques, these properties can be retrieved in some probabilistic sense [85]. In addition, using stochastic rules for activating the different blocks of variables often turns out to be more flexible.
To illustrate why there is interest in block coordinate approaches, let us split the target variable x as [x T 1 , . . . , x T K ] T , where, for every k ∈ {1, . . . , K}, x k ∈ R N k is the k-th block of variables with reduced dimension N k (with N 1 +· · ·+N K = N ). Let us further assume that the regularization function can be blockwise decomposed as where, for every k ∈ {1, . . . , K}, D k is a matrix in R P k ×N k , and g 1,k : R N k →] − ∞, +∞] and g 2,k : R P k →] − ∞, +∞] are proper lower-semicontinuous convex functions. Then, the stochastic primal-dual proximal algorithm allowing us to solve Problem (51) is given by In the algorithm above, for every i ∈ {1, . . . , M }, the scalar product h T i x has been rewritten in a blockwise manner as Under some stability conditions on the choice of the positive step sizes τ and γ, x n = [x T 1,n , . . . , x T K,n ] T converges almost surely to a solution of the minimization problem, as n → +∞ (see [137] for more technical details). It is important to note that the convergence result was established for arbitrary probabilities ε = [ε 1 , . . . , ε K ] T , provided that the block activation probabilities ε k are positive and independent of n. Note that the various blocks can also be activated in a dependent manner at a given iteration n. Like its deterministic counterparts (see [138] and the references therein), this algorithm enjoys the property of not requiring any matrix inversion, which is of paramount importance when the matrices (D k ) 1≤k≤K are of large size and do not have some simple forms.
When g 2,k ≡ 0, the random block coordinate forwardbackward algorithm is recovered as an instance of Algorithm 5 since the dual variables (v k,n ) 1≤k≤K,n∈N can be set to 0 and the constant τ becomes useless. An extensive literature exists on the latter algorithm and its variants. In particular, its almost sure convergence was established in [85] under general conditions, whereas some worst case convergence rates were derived in [139][140][141][142][143]. In addition, if g 1,k ≡ 0, the random block coordinate descent algorithm is obtained [144].
When the objective function minimized in Problem (51) is strongly convex, the random block coordinate forwardbackward algorithm can be applied to the dual problem, in a similar fashion to the dual forward-backward method used in the deterministic case [145]. This leads to so-called dual ascent strategies which have become quite popular in machine learning [146][147][148][149].
Random block coordinate versions of other proximal algorithms such as the Douglas-Rachford algorithm and ADMM have also been proposed [85,150]. Finally, it is worth emphasizing that asynchronous distributed algorithms can be deduced from various randomly activated block coordinate methods [137,151]. As well as dual ascent methods, the latter algorithms can also be viewed as incremental methods.  [4]. In this section we highlight some of the interesting new connections between modern simulation and optimization that we believe are particularly relevant for the SP community, and that we hope will stimulate further research in this community.

A. Riemannian manifold MALA and HMC
Riemannian manifold MALA and HMC exploit differential geometry for the problem of specifying an appropriate proposal covariance matrix Σ that takes into account the geometry of the target density π [20]. These new methods stem from the observation that specifying Σ is equivalent to formulating the Langevin or Hamiltonian dynamics in an Euclidean parameter space with inner product w, Σ −1 x . Riemannian methods advance this observation by considering a smoothly-varying position dependent matrix Σ(x), which arises naturally by formulating the dynamics in a Riemannian manifold. The choice of Σ(x) then becomes the more familiar problem of specifying a metric or distance for the parameter space [20]. Notice that the Riemannian and the canonical Euclidean gradients are related by∇g(x) = Σ(x)∇g(x). Therefore this problem is also closely related to gradient preconditioning in gradient descent optimization discussed in Sec IV.B. Standard choices for Σ include for example the inverse Hessian matrix [21,31], which is closely related to Newton's optimization method, and the inverse Fisher information matrix [20], which is the "natural" metric from an information geometry viewpoint and is also related to optimization by natural gradient descent [105]. These strategies have originated in the computational statistics community, and perform well in inference problems that are not too high-dimensional. Therefore, the challenge is to design new metrics that are appropriate for SP statistical models (see [152,153] for recent work in this direction).

B. Proximal MCMC algorithms
Most high-dimensional MCMC algorithms rely particularly strongly on differential analysis to navigate vast parameter spaces efficieoptimizationntly. Conversely, the potential of convex calculus for MCMC simulation remains largely unexplored. This is in sharp contrast with modern high-dimensional optimization described in Section IV, where convex calculus in general, and proximity operators [81,154] in particular, are used extensively. This raises the question as to whether convex calculus and proximity operators can also be useful for stochastic simulation, especially for high-dimensional target densities that are log-concave, and possibly not continuously differentiable.
This question was studied recently in [24] in the context of Langevin algorithms. As explained in Section II.B, Langevin MCMC algorithms are derived from discrete-time approximations of the time-continuous Langevin diffusion process (5). Of course, the stability and accuracy of the discrete approximations determine the theoretical and practical convergence properties of the MCMC algorithms they underpin. The approximations commonly used in the literature are generally well-behaved and lead to powerful MCMC methods. However, they can perform poorly if π is not sufficiently regular, for example if π is not continuously differentiable, if it is heavytailed, or if it has lighter tails than a Gaussian distribution. This drawback limits the application of MCMC approaches to many SP problems, which rely increasingly on models that are not continuously differentiable or that involve constraints.
Using proximity operators, the following proximal approximation for the Langevin diffusion process (5) was recently proposed in [24] X (t+1) ∼ N prox −δ 2 log π X (t) , δI N as an alternative to the standard forward Euler approximation X (t+1) ∼ N X (t) + δ 2 ∇ log π X (t) , δI n used in MALA 4 . Similarly to MALA, the time step δ can be adjusted online to achieve an acceptance probability of approximately 50%. It was established in [24] that when π is log-concave, (60) defines a remarkably stable discretization of (5) with optimal theoretical convergence properties. Moreover, the "proximal" MALA resulting from combining (60) with an MH step has very good geometric ergodicity properties. In [24], the algorithm efficiency was demonstrated empirically on challenging models that are not well addressed by other MALA or HMC methodologies, including an image resolution enhancement model with a total-variation prior. Further practical assessments of proximal MALA algorithms would therefore be a welcome area of research.
Proximity operators have also been used recently in [155] for HMC sampling from log-concave densities that are not continuously differentiable. The experiments reported in [155] show that this approach can be very efficient, in particular for SP models involving high-dimensionality and non-smooth priors. Unfortunately, theoretically analyzing HMC methods is difficult, and the precise theoretical convergence properties of this algorithm are not yet fully understood. We hope future work will focus on this topic.

C. optimization-driven Gaussian simulation
The standard approach for simulating from a multivariate Gaussian distribution with precision matrix Q ∈ R n×n is to perform a Cholesky factorization Q = L T L, generate an auxiliary Gaussian vector w ∼ N (0, I N ), and then obtain the desired sample x by solving the linear system Lx = w [156]. The computational complexity of this approach generally scales at a prohibitive rate O(N 3 ) with the model dimension N , making it impractical for large problems, (note however that there are specific cases with lower complexity, for instance when Q is Toeplitz [157], circulant [158] or sparse [156]).
optimization-driven Gaussian simulators arise from the observation that the samples can also be obtained by minimizing a carefully designed stochastic cost function [159,160]. For illustration, consider a Bayesian model with Gaussian likelihood y|x ∼ N (Hx, Σ y ) and Gaussian prior x ∼ N (x 0 , Σ x ), for some linear observation operator H ∈ R N ×M , prior mean x 0 ∈ R N , and positive definite covariance matrices Σ x ∈ R N ×N and Σ y ∈ R M ×M . The posterior distribution p(x|y) is Gaussian with mean µ ∈ R N and precision matrix Q ∈ R N ×N given by Simulating samples x|y ∼ N (µ, Q −1 ) by Cholesky factorization of Q can be computationally expensive when N is large. Instead, optimization-driven simulators generate samples by solving the following "random" minimization problem with random vectors w 1 ∼ N (y, Σ y ) and w 2 ∼ N (x 0 , Σ x ).
It is easy to check that if (61) is solved exactly, then x is a sample from the desired posterior distribution p(x|y). From a computational viewpoint, however, it is significantly more efficient to solve (61) approximately, for example by using a few linear conjugate gradient iterations [160]. The approximation error can then be corrected by using an MH step [161], at the expense of introducing some correlation between the samples and therefore reducing the total effective sample size. Fortunately, there is an elegant strategy to determine automatically the optimal number of conjugate gradient iterations that maximizes the overall efficiency of the algorithm [161].

VI. CONCLUSIONS AND OBSERVATIONS
In writing this paper we have sought to provide an introduction to stochastic simulation and optimization methods in a tutorial format, but which also raised some interesting topics for future research. We have addressed a variety of MCMC methods and discussed surrogate methods, such as variational Bayes, the Bethe approach, belief and expectation propagation, and approximate message passing. We also discussed a range of recent advances in optimization methods that have been proposed to solve stochastic problems, as well as stochastic methods for deterministic optimization. Subsequently, we highlighted new methods that combine simulation and optimization, such as proximal MCMC algorithms and optimization-driven Gaussian simulators. Our expectation is that future methodologies will become more flexible. Our community has successfully applied computational inference methods, as we have described, to a plethora of challenges across an enormous range of application domains. Each problem offers different challenges, ranging from model dimensionality and complexity, data (too much or too little), inferences, accuracy and computation times. Consequently, it seems not unreasonable to speculate that the different computational methodologies discussed in this paper will evolve to become more adaptable, with their boundaries becoming less well defined, and with the development of algorithms that make use of simulation, variational approximations and optimization simultaneously. Such an approach is more likely to be able to handle an even wider range of models, datasets, inferences, accuracies and computing times in a computationally efficient way.