Expectation Propagation for Poisson Data

The Poisson distribution arises naturally when dealing with data involving counts, and it has found many applications in inverse problems and imaging. In this work, we develop an approximate Bayesian inference technique based on expectation propagation for approximating the posterior distribution formed from the Poisson likelihood function and a Laplace type prior distribution, e.g., the anisotropic total variation prior. The approach iteratively yields a Gaussian approximation, and at each iteration, it updates the Gaussian approximation to one factor of the posterior distribution by moment matching. We derive explicit update formulas in terms of one-dimensional integrals, and also discuss stable and efficient quadrature rules for evaluating these integrals. The method is showcased on two-dimensional PET images.


Introduction
The Poisson distribution is widely employed to describe inverse and imaging problems involving count data, e.g., emission computed tomography [43,39], including positron emission tomography and single photon emission computed tomography. The corresponding likelihood function is a Poisson distribution with its parameter given by an affine transform (followed by a suitable link function). Over the past few decades, the mathematical theory and numerical algorithms for image reconstruction with Poisson data have witnessed impressive progresses. We refer interested readers to [21] for a comprehensive overview on variational regularization techniques for Poisson data and [3] for mathematical modeling and numerical methods for Poisson data.
To cope with the inherent ill-posed nature of the imaging problem, regularization plays an important role. This can be achieved implicitly via early stopping during an iterative reconstruction procedure (e.g., EM algorithm or Richardson-Lucy iterations) or explicitly via suitable penalties, e.g., Sobolev penalty, sparsity and total variation. The penalized maximum likelihood (or equivalently maximum a posteriori (MAP)) is currently the most popular way for image reconstruction with Poisson models [11,40]. However, these approaches can only provide point estimates, and the important issue of uncertainty quantification, which provides crucial reliability assessment on point estimates, is not fully addressed. The Bayesian approach provides a principled yet very flexible framework for uncertainty quantification of inverse and imaging problems [26,41]. The prior distribution acts as a regularizer, and the ill-posedness of the imaging problem is naturally dealt with. Due to the imprecise prior knowledge of the solution and the presence of the data noise, the posterior distribution contains an ensemble of inverse solutions consistent with the observed data, which can be used to quantify the uncertainties associated with a point estimator, via, e.g., credible interval or highest probability density regions.
For imaging problems with Poisson data, a full Bayesian treatment is challenging, due to the nonnegativity constraint and high-dimensionality. There are several possible strategies. One idea is to use general-purposed sampling methods to explore the posterior state space, predominantly Markov chain Monte Carlo (MCMC) methods [31,36]. Then the constraints on the signal can be incorporated directly by discarding samples violating the constraint. However, in order to obtain accurate statistical estimates, sampling methods generally require many samples and thus tend to suffer from high computational cost, due to the high problem dimensionality. Further, the MCMC convergence is challenging to diagnose. These observations have motivated intensive research works on developing approximate inference techniques (AITs). In the machine learning literature, a large number of AITs have been proposed, e.g., variational inference [25,5,8,23,2], expectation propagation [33,32] and more recently Bayesian (deep) neural network [16]. In all AITs, one aims at finding an optimal approximate yet tractable distribution within a family of parametric/nonparametric probability distributions (e.g., Gaussian), by minimizing the error in a certain probability metric, prominently the Kullback-Leibler divergence. Empirically they can often produce reasonable approximations but at a much reduced computational cost than MCMC. However, there seem no systematic strategies for handling constraints in these approaches. For example, a straightforward truncation of the distribution due to the constraint often leads to elaborated distributions, e.g., truncated normal distribution, which tends to make the computation tedious or even completely intractable in variational Bayesian inference.
In this work, we develop a computational strategy for exploring the posterior distribution for Poisson data (with two popular nonnegativity constraints) with a Laplace type prior based on expectation propagation [33,32], in order to deliver a Gaussian approximation. Laplace prior promotes the sparsity of the image in a transformed domain, which is a valid assumption on most natural images. The main contributions of the work are as follows. First, we derive explicit update formulas in terms of one-dimensional integrals. It essentially exploits the rank-one projection form of the factors to reduce the intractable high-dimensional integrals to tractable one-dimensional ones. In this way, we arrive at two approximate inference algorithms, parameterized by either moment or natural parameters. Second, we derive stable and efficient quadrature rules for evaluating the resulting one-dimensional integrals, i.e., a recursive scheme for Poisson sites with large counts and an approximate expansion for Laplace sites, and discuss different schemes for the recursion, dependent of the integration interval, in order to achieve good numerical stability. Last, we illustrate the approach with comprehensive numerical experiments with the posterior distribution formed by Poisson likelihood and an anisotropic total variation prior, clearly showcasing the feasibility of the approach.
Last, we put the work in the context of Bayesian analysis of Poisson data. The predominant body of literature in statistics employs a log link function, commonly known as Poisson regression in statistics and machine learning (see, e.g., [7,2]). This differs substantially from the one frequently arising in medical imaging, e.g., positron emission tomography, and in particular the crucial nonnegativity constraint becomes vacuous. The only directly relevant work we are aware of is the recent work [27]. The work [27] discussed a full Bayesian exploration with EP, by modifying the posterior distributions using a rectified linear function on the transformed domain of the signal, which induces singular measures on the region violating the constraint. However, the work [27] does not consider the background.
The rest of the paper is organized as follows. In Section 2 we describe the posterior distribution for the Poisson likelihood function and a Laplace type prior. Then we give explicit expressions of the integrals involved in EP update and describe two algorithms in Section 3. In Section 4 we present stable and efficient numerical methods for evaluating one-dimensional integrals. Last, in Section 5 we present numerical results for two-dimensional inverse problems. In the appendices, we describe two useful parameterizations of a Gaussian distribution, Laplace approximation and additional comparative numerical results for a one-dimensional problem with MCMC and Laplace approximation.

Problem formulation
In this part, we give the Bayesian formulation for Poisson data, i.e., the likelihood function p(y|x) and prior distribution p(x), and discuss the nonnegativity constraint.
Let x ∈ R n be the (unknown) signal/image of interest, y ∈ R m1 + be the observed Poisson data, be the forward map, where the superscript t denotes matrix / vector transpose. The entries of the matrix A are assumed to be nonnegative. For example, in emission computed tomography, it can be a discrete analogue of Radon transform, or probabilistically, the entry a ij of the matrix A denotes the probability that the ith sensor pair records the photon emitted at the jth site.
The conditional probability density p(y i |x) of observing y i ∈ N given the signal x is given by + is the background. That is, the entry y i follows a Poisson distribution with a parameter a t i x + r i . The Poisson model of this form is popular in the statistical modeling of inverse and imaging problems involving counts, e.g., positron emission tomography [43]. If the entries of y are independent and identically distributed (i.i.d.), then the likelihood function p(y|x) is given by Note that the likelihood function p(y|x) is not well-defined for all x ∈ R n , and suitable constraints on x are needed in order to ensure the well-definedness of the factors p(y i |x)'s. In the literature, there are three popular constraints: Since the entries of A are nonnegative, there holds C 1 ⊂ C 2 ⊂ C 3 . In practice, the first assumption is most consistent with the physics in that it reflects the physical constraint that emission counts are non-negative. The last two assumptions were proposed to reduce positive bias in the cold region [29], i.e., the region that has zero count. In this work, we shall focus on the last two constraints.
The constraints C 2 and C 3 can be unified, which is useful for the discussions below.
Definition 2.1. For each likelihood factor p(y i |x) with the constraint C 2 , let For each likelihood factor p(y i |x) with the constraint C 3 , let Then the constraints C 2 and C 3 are both given by V + = ∩ i V + i and V − = R n \V + . With the indicator function 1 V + (x) of the set V + , we modify the likelihood function p(y|x) by This extends the domain of p(y|x) from V + to R n , and it facilitates a full Bayesian treatment. Since the indicator function 1 V + (x) admits a separable form, i.e., To fully specify the Bayesian model, we have to stipulate the prior p(x). We focus on a Laplace type prior. Let L ∈ R m2×n and L t i ∈ R n×1 be the ith row of L. Then a Laplace type prior p(x) is given by The parameter α > 0 determines the strength of the prior, playing the crucial role of a regularization parameter in variational regularization [22]. The choice of the hyperparameter α in the prior p(x) is notoriously challenging [22]. One may apply hierarchical Bayesian modeling in order to estimate it from the data simultaneously with q(x) [45,24,2]. The prior p(x) is commonly known as a sparsity prior (in the transformed domain), which favors a candidate with many small elements and few large elements in the vector Lx. The canonical total variation prior is recovered when the matrix L computes the discrete gradient. It is well known that the total variation penalty can preserve well edges in the image/signals, and hence it has been very popular for various imaging tasks [37,9]. By Bayes' formula, we obtain the Bayesian solution to the Poisson inverse problem, i.e., the posterior probability density function: The computation of Z is generally intractable for high-dimensional problems, and p(x|y) has to be approximated.

Approximate inference by expectation propagation
In this section, we describe the basic concepts and algorithms of expectation propagation (EP), for exploring the posterior distribution (2.1). EP due to Minka [33,32] is a popular variational type approximate inference method in the machine learning literature. It is especially suitable for approximating a distribution formed by a product of functions, with each factor being of projection form. Since its first appearance in 2001, EP has found many successful applications in practice, and it is reported to be very accurate, e.g., for Gaussian processes [35], and electrical impedance tomography with sparsity prior [18]. However, the theoretical understanding of EP remains quite limited [13,12].
EP looks for an approximate Gaussian distribution q(x) to a target distribution by means of an iterative algorithm. It relies on the following factorization of the posterior distribution p(x|y) (with m = m 1 + m 2 being the total number of factors): Note that each factor t i (x) is a function defined on the whole space R n . Likewise, we denote the Gaussian approximation q(x) to the posterior distribution p(x|y) by with each factort i (x) being a Gaussian distribution N (x|µ i , C i ), andZ is the corresponding normalizing constant. Below we use two different parameterizations of a Gaussian distribution, i.e., moment parameters (mean and covariance) (µ, C) and natural parameters (h, Λ); see Appendix A. Both parameterizations have their pros and cons: the moment one does not require solving linear systems, and the natural one allows singular Gaussians fort i (x).

Reduction to one-dimensional integrals
There are two main steps of one EP iteration: (a) forming a tilted distributionq i (x), and (b) updating the Gaussian approximation q(x) by matching its moments with that ofq i (x). The moment matching step can be interpreted as minimizing Kullback-Leilber divergence KL(q i ||q) [33,32,18]. Recall that the Kullback-Leibler divergence from one probability distribution p(x) to another q(x) is defined by [28] By Jensen's inequality, the divergence D KL (p||q) is always nonnegative, and it vanishes if and only if p(x) = q(x) almost everywhere. The task at step (a) is to construct the ith tilted distributionq i (x). Let q \i (x) be the ith cavity distribution, i.e., the product of all but the ith factor, and defined by , whose moment and natural parameters are denoted by (µ \i , C \i ) and (h \i , Λ \i ), respectively. Then the ith tilted distributionq i (x) of the approximation q(x) is given byq dx is the corresponding normalizing constant. With the exclusioninclusion step, one replaces the ith factort i (x) in the approximation q with the exact one t i (x).
The task at step (b) is to compute moments of the ith tilde distributionq i (x), which are then used to update the approximation q(x). This requires integration over R n , which is generally numerically intractable, ifq i (x) were arbitrary. Fortunately, each factor t i (x) in (3.1) is of projection form and depends only on the scalar u t x, with the vector u ∈ R n being either a i or L i . This is the key fact to render relevant high-dimensional integrals numerically tractable. Below we write the factor t i (x) as t i (u t i x) and accordingly, the ith cavity functionq i (x) aŝ upon replacing j =it i (x) with its normalized version N (x|µ \i , C \i ), and accordingly the normalizing constantẐ i . Since a Gaussian is determined by its mean and covariance, it suffices to evaluate the 0th to 2nd moments ofq i (x). The projection form of the factor t i allows reducing the moment evaluation ofq i (x) to 1D integrals. Theorem 3.1 gives the update scheme for q(x) fromq i (x).
Then with the auxiliary variabless ∈ R and C s defined bȳ are given respectively by Similarly, the precision mean hq i and precision Λq i are given respectively by Proof. The expressions forẐ i , µ and C were given in [18,Section 3]. Thus it suffices to derive the formulas for (h, Λ). Recall the Sherman-Morrison formula [20, p. 65]: for any invertible B ∈ R n×n , u, v ∈ R n , there holds Then the precision matrix Λ is given by Similarly, the precision mean h := Λµ is given by This completes the proof of the theorem.
In both approaches, the 1D integrals (Z s ,s, C s ) are needed, which depend on u t i µ \i and u t i C \i u i . A direct approach is first to downdate (the Cholesky factor of) Λ and then to solve a linear system. In practice, this can be expensive and the cost can be mitigated. Indeed, they can be computed without the downdating step; see Lemma 3.1 below. Below we use the super-or subscript n and o to denote a variable updated at current iteration from that of the last iteration.
be the natural parameter of q(x) and (λ 1,i , λ 2,i ) be defined in Theorem 3.1. Then the mean u t i µ \i and variance u t i C \i u i of the Gaussian distribution N (s|u t i µ \i , u t i C \i u i ) are respectively given by Proof. We suppress the sub/superscript o. By the definition of u t i C \i u i and the Sherman-Morrison formula (3.3), we have and similarly, we have This completes the proof of the lemma.
Since the quantities for the 1D integrals can be calculated from variables updated in the last iteration, it is unnecessary to form cavity distributions. Indeed, the cavity precision is formed by Λ \i = Λ o −λ o 2,i u i u t i , and the updated precision is given by Λ n = Λ \i + λ n 2,i u i u t i ; and similarly for h. Thus, we can update Λ directly with (λ o 2,i , λ n 2,i ) and h with (λ o 1,i , λ n 1,i ); this is summarized in the next remark.
Remark 3.1. The differences λ n k,i − λ o k,i , k = 1, 2, can be used to update the natural parameter (h, Λ): Moreover, the sign of λ n 2,i − λ o 2,i determines whether to update or downdate the Cholesky factor of Λ.

Update schemes and algorithms
Now we state the direct update scheme, i.e. without explicitly constructing the intermediate cavity distribution q \i (x), for both natural and moment parameterizations.
Let (h, Λ) and (µ, C) be the natural and moment parameters of the Gaussian approximation q(x), respectively. The following update schemes hold.
(i) The precision mean h and precision Λ can be updated by (ii) The mean µ and covariance C can be updated by Proof. The first assertion is direct from Theorem 3.1 and Remark 3.1, and it can be rewritten as By Sherman-Morrison formula (3.3), the covariance C n = Λ −1 n is given by where the scalar η 2 := −( where the second identity follows from Remark 3.1. Similarly, the mean µ n := C n h n is given by where, in view of Remark 3.1, This completes the proof of the theorem. All matrix operations in Theorem 3.2 are of rank one type, which can be implemented stably and efficiently with the Cholesky factors and their update / downdate; see Section 3.3 for details. Thus, in practice, we employ Cholesky factors of the precision Λ and covariance C, denoted by Λ chol and C chol , respectively, instead of Λ and C. Further, we also use the auxiliary variables (λ 1,i , λ 2,i ) defined in Theorem 3.1, and stack {(λ 1,i , λ 2,i )} m1+m2 i=1 into two vectors which are initialized to zeros. Thus, we obtain two inference procedures for Poisson data with a Laplace type prior in Algorithms 1 and 2.
The rigorous convergence analysis of EP is outstanding. Nonetheless, empirically, it often converges very fast, which is also observed in our numerical experiments in Section 5. In practice, one can terminate the iteration by monitoring the relative change of the parameters or fixing the maximum number K of iterations. The important task of computing 1D integrals will be discussed in Section 4 below. Randomly choose an index i to update;

5:
Compute the mean and variance for 1D Gaussian integral by Lemma 3.1;

9:
Check the stopping criterion. Randomly choose an index i to update;

5:
Compute the mean and variance for 1D Gaussian integral by Lemma 3.1;

Efficient implementation and complexity estimate
The rank-one matrix update A ± βuu t , for A ∈ R n×n , u ∈ R n and β > 0, can be stably and efficiently updated / downdated with the Cholesky factor of A with √ βu. The update step of A can be viewed as an iteration from A k to A k+1 . Let the upper triangular matrices R k and R k+1 be the Cholesky factors of A k and A k+1 respectively, i.e., A k = R t k R k and A k+1 = R t k+1 R k+1 . There are two possible cases: The update/downdate is available in several packages. For example, in MATLAB, the function cholupdate implements the update/downdate of Cholesky factors, based on LAPACK subroutines ZCHUD and ZCHDD. Next, we discuss the computational complexity per inner iteration. The first step picks one index i, which is of constant complexity. For the second step, i.e., computing the mean and variance for 1D integrals, the dominant part is linear solve involving upper triangular matrices and matrix-vector product for natural and moment parameters. For either parameterization, it incurs O(n 2 ) operations. The third step computess and C s from the one dimensional integrals. For Poisson site, the complexity is O(y i ), and for Laplace site, it is O(1). Last, the fourth step is dominated by Cholesky factor modifications, and its complexity is O(n 2 ). Overall, the computational complexity per inner iteration is O(n 2 + y i ). In a large data setting, y i n, and thus the complexity is about O(n 2 ). In passing, we note that in practice, the covariance / precision matrix may admit additional structures, e.g., sparsity, which translate into structures on Cholesky factors. For the general sparsity assumption, it seems unclear how to effectively exploit it for Cholesky update/downdate for enhanced efficiency, except the diagonal case, which can be incorporated into the algorithm straightforwardly.

Stable evaluation of 1d integrals
Now we develop a stable implementation for the three 1D integrals: Z s ,s and C s in Theorem 3.1. These integrals form the basic components of Algorithms 1 and 2, and their stable, accurate and efficient evaluation is crucial to the performance of the algorithms. By suppressing the subscript i, we can write the integrals in a unified way: where the factor t(s) is either Poisson likelihood or Laplace prior. Then we can expresss and C s in terms of J j bys Note that the normalizing constants in J j cancel out ins and C s , and thus they can be ignored when evaluating the integrals. In essence, the computation boils down to stable evaluation of moments of a (truncated) Gaussian distribution. This task was studied in several works [10,38]: [10] focuses on Gaussian moments, and [38] discusses also evaluating the integrals involving Laplace distributions. Below we derive the formulas for the (constrained) Poisson likelihood and Laplace prior separately.

Poisson likelihood
Throughout, we suppress the subscript i, write V + etc in place of V + i etc and introduce the scaler variable s = a t x. Then the constraint on x transfers to that on s: a t x > 0 corresponds to s > 0 and a t x + r > 0 to s > −r, respectively. We shall slightly abuse the notation and use 1 V+ (s) as the indicator for the constraint on s. Then the Poisson likelihood t(x) can be equivalently written in either x or s as t(x) = (a t x + r) y e −(a t x+r) y! 1 V+ (x) and t(s) = (s + r) y e −(s+r) y! 1 V+ (s).
Note that the factorial y! cancels out when computings and C s , so it is omitted in the derivation below. For a fixed N (s|m, σ 2 ), the integrals J y,j depend on the observed count data y and moment order j: where the lower integral bound b = 0 or b = −r, which is evident from the context. Note that the terms e −(s+r) and N (s|m, σ 2 ) in J y,j together give an unormalized Gaussian density. This allows us to reduce the integrals J y,j into (truncated) Gaussian moment evaluations of the type: and accordinglys and C s . This is given in the next result.
The desired identities follow from the definitions and the recursions in (4.1) bȳ This completes the proof.
However, directly evaluating I y can still be numerically unstable for large y. To avoid the potential instability, we develop a stable recursive scheme on I y .
Lemma 4.1. For y ≥ 2, the following recursion holds . The definition of I y implies Next we employ the trivial identity d ds f (s) = − s−c d f (s) and apply integration by parts to the first term Collecting the terms shows the desired recursion on the integral I y .
Lemma 4.1 uses a two-term linear recurrence relation for I y 's. The coefficients of I y−1 and I y−2 are raised by power when expanding I y in terms of I 0 and I 1 , and thus the computation of I y is susceptible to the evaluation errors of I 0 and I 1 for large y. This motivates a reciprocal recursive scheme by introducing a ratio sequence {L y } y defined by L y = yIy−1 Iy , for r = 0 or b = −r, in order to restore the numerical stability. Note that L y also admits a recursive scheme L y = y (m−σ 2 +r)+σ 2 Ly−1 , and further I y can be recovered from {L y } by ln I y = ln y! + ln I 0 − y i=1 L i . We can computes and C s directly from L y . The identities follow from straightforward computation. Last, we discuss the computation of the first three integrals I 0 , I 1 and I 2 , which are needed for the recursion. We employ three different forms according to the integration range with respect to the auxiliary variable The formulas are listed in Table 1, where erf and erfc denote the error function and complementary error function, respectively, and erfcx(η) = e η 2 (1 − erf(η)). Since the value of 1 − erf(η) is vanishingly small for large η value, we use Scheme 2 to avoid underflow. Scheme 3 is useful when the η value is large, since both 1 − erf(η) and erfc(η) suffer from numerical underflow. Note that when η is small, Scheme 3 is not as accurate as Scheme 2, so we use Scheme 2 in the intermediate range.

Laplace potential
Now we derive the formulas for evaluating the 1D integrals for the Laplace potential t(x) = α 2 e −α| t x| . For any fixed ∈ R n , we divide the whole space R n into two disjoint half-spaces V + and V − , i.e., R n = V + ∪V − , with V + = {x| t x > 0} and V − = {x| t x ≤ 0}. Then we split the Laplace potential t(x) into The integrals involving t(x)N (s|µ, σ 2 ) (slightly abusing µ) can be divided into two parts: By the change of variable t = s−µ±ασ 2 σ for I ± i respectively, we have These integrals can be expressed using the cumulative distribution function Φ of the standard Gaussian distribution. We shall view I ± i as functions of µ and let I i = I + i (µ) + (−1) i I + i (−µ). Then we havē s = I 1 I 0 and C s = To avoid the potential underflow of direct evaluation of Φ, we use the following well known (divergent) asymptotic expansion [1, item 7.1.23] . This formula follows by integration by parts, and allows accurate evaluation for large positive η. It was shown in [17] that the error of evaluating 1 − Φ(η) with a truncation of the asymptotic expansion is less than 10 −11 for η > 5 with more than 8 terms in the summation of g(η). For η ≤ 5, 1 − Φ(η) can be accurately evaluated directly. Then we introduce a ratio .
With the ratio β, the two fractions I1 I0 and I2 I0 can be evaluated by To avoid potential numerical instability of the first term in I2 I0 , we use the identity .

Numerical experiments
Now we numerically illustrate the EP algorithm on realistic images. In the implementation, we employ the natural parameter parameterization, i.e., Algorithm 1, which appears to be numerically more robust. We measure the accuracy of a reconstruction x * relative to the ground truth x † by the standard L 2 -error ||x * − x † || 2 , the structural similarity (SSIM) index (by MATLAB built-in ssim), and peak signal-to-noise ratio (PSNR) (by MATLAB built-in psnr with peak value 1). For comparison, we also present MAP, computed by a limited-memory BFGS algorithm [30] with constraint C 1 . The hyperparameter α in the prior p(x) is determined in a trial-and-error manner.  Tables 2 and 3. The EP mean is mostly comparable with MAP in all three metrics, and the reconstruction quality improves steadily as the number of projection angles increases. These results show clearly the feasibility of EP for images of realistic size. Interestingly, the shape of the EP variance resembles closely that of the phantom. This might indicate that the algorithm is rather certain in the cold regions where the error is close to zero and more uncertain about the region where the error is potentially larger.
To further illustrate the approximation, we plot the cross-sections in Fig. 5 and 95% highest posterior denstiy (HDP) region estimated from the EP covariance. The EP mean is close to MAP, and thus also suffers slightly from a reduced magnitude, as is typical of the total variation penalty in variational regularization. This also concurs with the error metrics in Tables 2 and 3. The thrust of EP is that it can also provide uncertainty estimates via covariance, which is directly unavailable from MAP. In sharp contrast, the popular Laplace approximation (see Appendix B) can fail to yield a reasonable   approximation for nonsmooth priors, whereas MCMC is prohibitively expensive for large images, though being asympotically exact; see Appendix C for further numerical results. So overall, EP represents a computationally feasible approach to deliver uncertainty estimates. Next, we present an experimental evaluation of the convergence of the EP algorithm, which is a long outstanding theoretical issue, on the following setup: Shepp-Logan phantom and Radon matrix A ∈ R 4255×16384 (i.e., 185 projections per angle and [0 : 8 : 179]). We denote the mean and covariance after k outer iterations by µ k and C k , respectively, and the converged iterate tuple by (µ * , C * ). The EP mean µ k converges rapidly, and visually it reaches convergence after five iterations since thereafter the    Fig. 7 shows the errors of the iterate tuple (µ k , C k ) with respect to (µ * , C * ), where the errors δµ = µ k − µ * and δC = C k − C * are measured by the L 2 -norm and spectral norm, respectively. This phenomenon is also observed for all other experiments, although not presented. Hence, both mean and covariance converge rapidly, showing the steady and fast convergence of EP. Last, we illustrate the inference procedure with data taken from Michigan Image Reconstruction  These numerical results with different experimental settings show clearly that EP can provide comparable point estimates with MAP as well as uncertainty information by means of the variance estimate.

Conclusion
In this work, we have developed inference procedures for the constrained Poisson likelihood arising in emission tomography. They are based on expectation propagation developed in the machine learning community. The detailed derivation of the algorithms, complexity and their stable implementation are given for a Laplace type prior. Extensive numerical experiments show that the EP algorithm (with natural parameters) converges rapidly and can deliver an approximate posterior distribution with the approximate mean comparable with MAP, together with uncertainty estimate, and can scale to realistic real images. Thus, the approach can be viewed as a feasible fast alternative to the general-purposed but expensive MCMC for rapid uncertainty quantification with Poisson data. There are several avenues for future works. First, it is of enormous interest to analyze the convergence rate and accuracy of EP, and more general approximate inference techniques, e.g., variational Bayes, which have all achieved great practical successes but largely defied theoretical analysis. Second, it is important to further extend the flexibility of EP algorithms to more complex posterior distributions, e.g., lack of projection form. One notable example is isotropic total variation prior that appears frequently in practical imaging algorithms. This may require introducing an additional layer of approximation in the spirit of iteratively reweighed least-squares or Monte Carlo computation of low-dimensional integrals. Third, many experimental studies show that EP converges very fast, with convergence reached within five outer iterations for the Poisson model under considerations. However, the overall O(mn 2 ) computational complexity is still very high for all current implementations [19], and not scalable well to truly large images. Hence, it is of great interest to accelerate the algorithms, e.g., low-rank structure of the map A and diagonal dominance of the posterior covariance.

A Parameterizing Gaussian distributions
For a Gaussian N (x|µ, C) with mean µ ∈ R n and covariance C ∈ S n + , the density π(x|µ, C) is given by where the parameters Λ ∈ S n + , h ∈ R n and ζ ∈ R are respectively given by Λ = C −1 , h = Λµ, and ζ = − 1 2 (n log 2π + log |Λ| + µ t Λµ). Thus, the density π(x|µ, C) is also uniquely defined by Λ and h. In the literature, Λ is often referred to as the precision matrix and h as the precision mean. and the pair (h, Λ) is called the natural parameter of a Gaussian distribution.
It is easy to check that the product of k Gaussians {N (x|µ k , C k )} m k=1 is also a Gaussian N (x|µ, C) after normalization, and the mean µ and covariance C of the product are given by

B Laplace approximation
In the engineering community, one popular approach to approximate the posterior distribution p(x|y) is Laplace approximation [42,4]. It constructs a Gaussian approximation by the second-order Taylor expansion of the negative log-posterior − log p(x|y) around MAPx. Upon ignoring the unimportant constant and smoothing the Laplace potential, the negative log-posterior J(x) is given by where > 0 is a small smoothing parameter to restore the differentiability. The gradient ∇J(x) and Hessian ∇ 2 J(x) are given respectively by Since ∇J(x) = 0, the Taylor expansion reads and ∇ 2 J(x) approximates the precision matrix. When |L t ix |, the second term in ∇ 2 J(x) can be negligible and thus the Hessian of the negative log-likelihood is dominating; whereas for |L t ix |, the second term is dominating. In either case, the approximation is problematic. In practice, it is also popular to combine smoothing with an iterative weighted approximation (e.g., lagged diffusivity approximation [44]) by fixing ((L t i x) 2 + 2 ) 1/2 in ∇J(x) at ((L t ix ) 2 + 2 ) 1/2 , which leads to a modified Hessian: The Hessians ∇ 2 J(x) and ∇ 2 J(x) will be close to each other, if |L t ix | are all small, which is expected to hold for truly sparse signals, i.e., L t i x ≈ 0 for i = 1, . . . , m 2 . One undesirable feature of Laplace approximation is that the precision approximation depends crucially on the smoothing parameter .

C Comparison with MCMC and Laplace approximation
Numerically, the accuracy of EP has found to be excellent in several studies [35,18], although there is still no rigorous justification. We provide an experimental evaluation of its accuracy with Markov chain Monte Carlo (MCMC) and Laplace approximation. The true posterior distribution p(x|y) can be explored by MCMC [31,36]. However, usually a large number of samples are required to obtain reliable statistics. Thus, to obtain further insights, we consider a one-dimensional problem, i.e., a Fredholm integral equation of the first kind [34] over the interval [−6, 6] with the kernel K(s, t) = φ(s − t) and exact solution x(t) = φ(t), where φ(s) = 10 + 10 cos π 3 sχ [− 3,3] . It is discretized by a standard piecewise constant Galerkin method, and the resulting problem is of size 100, i.e., x ∈ R 100 and A ∈ R 100×100 . We implement a random walk Metropolis-Hastings sampler with Gaussian proposals, and optimize the step size so that the acceptance ratio is close to 0.23 in order to ensure good convergence [6]. The hyperparameter α in the prior distribution is set to 1. The chain is run for a length of 2 × 10 7 , and the last 10 7 samples are used for computing the mean and covariance.
To compare the Gaussian approximation by EP and MCMC results, we present the mean, MAP, covariance and 95% HPD region. Both approximations concentrate in the same region, and the shape and magnitude of 95% HPD / covariance are mostly comparable; see Figs. 10 and 11, showing the validity of EP. However, there are noticeable differences in the recovered mean: the EP mean is nearly piecewise constant, which differs from that by MCMC. So EP gives an intermediate approximation between the MAP and posterior mean. In comparison with MAP, EP provides not only a point estimate, but also the associated uncertainty, i.e., covariance. Interestingly, the covariance is clearly diagonal dominant, which suggests the use of a banded covariance or its Cholesky factor for speeding up the algorithm. The Laplace approximation described in Appendix B depends heavily on the smoothing parameter , and clearly there is a tradeoff between accuracy of MAP and the variance approximation; see Fig. 12 for the numerical results corresponding to four different smooth parameters , based on the approximation (B.1). This tradeoff is largely attributed to the nonsmooth Laplace type prior, which pose significant challenges for constructing the approximation. Thus, it fails to yield a reasonable approximation to the target posterior distribution. In contrast, the EP algorithm only involves integrals, which are more amenable to non-differentiability, and thus can handle nonsmooth priors naturally.
In passing, we note that the uncertainty estimate from the posterior distribution differs greatly from the concept of noise variance [15], which is mainly concerned with the sensitivity of the reconstruction with respect to the noise in the input data y. It is derived using chain rule and implicit function theorem, under the assumptions of good smoothness and local strong convexity of the associated functional [15]. In contrast, the uncertainty in the Bayesian framework stems from imprecise knowledge of the inverse solution encoded in the prior and the statistics of the data. Thus, the results of these two approaches are not directly compared.