A variational inference framework for inverse problems

A framework is presented for fitting inverse problem models via variational Bayes approximations. This methodology guarantees flexibility to statistical model specification for a broad range of applications, good accuracy and reduced model fitting times. The message passing and factor graph fragment approach to variational Bayes that is also described facilitates streamlined implementation of approximate inference algorithms and allows for supple inclusion of numerous response distributions and penalizations into the inverse problem model. Models for one- and two-dimensional response variables are examined and an infrastructure is laid down where efficient algorithm updates based on nullifying weak interactions between variables can also be derived for inverse problems in higher dimensions. An image processing application and a simulation exercise motivated by biomedical problems reveal the computational advantage offered by efficient implementation of variational Bayes over Markov chain Monte Carlo.


Introduction
Inverse problems are essentially statistical regression problems where a response depending on a number of parameters is measured and the goal is to interpret the parameter estimates, rather then predict the outcome.Stable fitting of inverse problems is crucial but this is generally hindered by a large number of parameters and the presence of predictors which are highly correlated.
Let y denote a vector of data and suppose this data is related to a vector of unknown parameters x to estimate by a linear regression problem E(y) = Kx, where K is given, or a nonlinear one such that E(y) = g(x), where g is a known function.From a Bayesian perspective, model fitting can be performed by placing a prior on x to then find the maximum a posteriori estimate x = argmax x p(x|y).This appears straightforward in principle, however, in typical applications, inverse problems may be ill-posed in a sense that either the solution does not exist, is not unique or does not depend smoothly on the data, as small noise variations can produce significantly different estimates (Hadamard, 1902).A remedy is to introduce a penalization in the model formulation and use Bayesian hierarchical models, but these can be slow to fit via standard Markov chain Monte Carlo methods.To overcome this issue, we propose and study variational Bayes methods.The direct and message passing approaches to variational Bayes we examine facilitate inverse problem fitting in Bayesian settings with reduced computational times.
The use of variational Bayesian methods for inverse problems has been shown in the literature concerning neural source reconstruction, including Sato et al. (2004), Kiebel et al. (2008), Wipf and Nagarajan (2009) and Nathoo et al. (2014).Approximate inference methods motivated by a broader class of inverse problem applications are in their infancy.A small, growing, literature includes McGrory et al. (2009), Gehre and Jin (2014) and Guha et al. (2015).Arridge et al. (2018) and Zhang et al. (2019) respectively study usage of Gaussian variational approximations and expectation propagation to fit inverse problems models with Poisson responses.Tonolini et al. (2020) propose a framework to train variational inference for imaging inverse problems exploiting existing image data.Agrawal et al. (2022) study variational inference for inverse problems with gamma hyperpriors.Povala et al. (2022) present a stochastic variational Bayes approach based on sparse precision matrices.
The state-of-the-art in approximate inference for inverse problems is to derive and code algorithm updates from scratch each time a model is modified.The message passing on factor graph fragment approach to variational Bayes we suggest in this work overcomes the issue.Wand (2017) has spearheaded adoption of this approach to fast approximation inference in regression-type models via variational message passing (VMP).In the same spirit, we lay down similar infrastructure for inverse problems and propose VMP as an alternative to the more common mean field variational Bayes (MFVB).We show how to perform approximate inference by combining algorithms for single factor graph components, or fragments, that arise from inverse problem models.The resultant factor graph fragments facilitate streamlined implementation of fast approximate algorithms and form the basis for software development for use in applications.The factor graph fragment paradigm allows for easy incorporation of different penalization structures in the model or changes to the distribution of the outcome variable.In fact, VMP on factor graph fragments is such that the corresponding algorithms only need to be derived once for a particular fragment and can be used for any arbitrarily complex model including such a fragment.Hence dramatically reducing set-up overheads as well as providing fast implementation.
Motivated by a real biomedical problem, we identify a base inverse problem model and describe how to efficiently perform MFVB and VMP.The application we show concerns medical positron emission tomography imaging where the raw data is processed for image enhancement.The data were collected to illustrate a small animal imaging system which can be used in biotechnology and pre-clinical medical research to help detect tumors or organ dysfunctions.An application to two-dimensional deconvolution problems motivated by an archaeological exploration is also embarked upon the base framework by varying the response and penalization distributional assumptions.This is illustrated in the supplementary material.

Overview of the article
Section 2 defines a reference inverse problem model for illustrating the methodology in use and our computational developments.The variational approximation engine for inverse problem fitting and inference is introduced in Section 3. Section 4 examines strategies to streamline variational inference algorithms.An application to real biomedical data is treated in Section 5.The same section reports results from a study involving simulations which resemble the analyzed biomedical dataset.The supplementary material provides an illustration concerning archaeological data performed via VMP, where the Normal response and Laplace penalization of the base model are replaced by a Skew Normal distribution for the outcome variable and a Horseshoe penalization.Concluding remarks and extensions are discussed in Section 6.
Before setting up our reference linear inverse problem model and presenting variational algorithms for approximate model fitting we introduce some useful notation.

Useful notation
For a matrix A of size d 1 × d 2 , vec(A) is the d 1 d 2 × 1 vector obtained by stacking the columns of A underneath each other in order from left to right.If a is a (d 1 d 2 ) × 1 vector then vec −1 d1×d2 (a) is the d 1 × d 2 matrix such that vec vec −1 d1×d2 (a) = a; when the vec operator inverse produces a square matrix the subscript is omitted.Vectors of d zeros or ones are respectively denoted by 0 d and 1 d .

Base inverse problem model
We consider linear inverse problems having the following formulation: where y is an m × 1 vector of observed data, K is a matrix acting as a linear operator of size m × m, x is a m × 1 vector of unknown parameters and ε is a Normal error vector of length m.For ease of illustration, we focus on the case where the vectors y and x have equal length m and therefore K is a square matrix.
Nevertheless, the methodology presented here can be adapted to the situation in which y has length n different from and typically smaller than the length m of x.Motivated by our biomedical application, we focus on a particular type of forward problem where K is a kernel matrix.Another formulation of K is discussed in the application to archaeological data treated in the supplementary material.
We assume that the the vector of observations y has a one-to-one correspondence with the vector of parameters x = (x 1 , . . ., x m ) to be estimated.For simplicity, here we only model first nearest neighbor interactions, or differences, between elements of x.If these elements are identified by a system of coordinates in two dimensions, then first nearest neighbor interactions are the differences between one parameter and those in adjacent locations on the horizontal and vertical coordinates of the parameter.
Suppose the aim is to study a linear inverse problem in a Bayesian framework according to the model ∼ Inverse-χ 2 (2, 1) , j = 1, . . ., d, where K is a matrix of size m × m, x ∆ is a vector of d differences between pairs of elements in x and A ε , A x > 0 are user-specified hyperparameters.The auxiliary variables a ε and a x generate Half-Cauchy(A ε ) and Half-Cauchy(A x ) priors on the scale parameters σ ε and σ x , respectively.Specifically, the density function of a random variable σ > 0 having a Half-Cauchy(A) distribution is p(σ) = 2/[Aπ{1 + (σ/A) 2 }], with A > 0. For problems that only contemplate first nearest neighbor differences, the scalar d coincides with the number of unique up to sign differences between pairs of elements of x coming from adjacent locations.
In the one-dimensional case, x can be interpreted as a vector matching m spatial locations on a line and the number of differences between adjacent locations will be d = m − 1. Model (2) also encompasses higherdimensional problems.In bidimensional settings, x can be conveniently expressed as the vectorization of a grid, or matrix, of pixels X by setting x = vec (X).If X has size m 1 × m 2 , the first nearest neighbor differences are d = m 1 (m 2 − 1) + m 2 (m 1 − 1).In the simple example of Figure 1 where X is of size 3 × 4, the number of horizontal and vertical differences are respectively 9 and 8, giving d = 17 differences in total.
In a similar vein, the model can be applied to three-dimensional problems by letting x be the vectorization of voxel-type data.
The distributional assumption on x ∆ in model ( 2) can be conveniently re-expressed as where L is some d × m contrast matrix such that x ∆ ≡ Lx and b = (b 1 , . . ., b d ).For instance in onedimensional problems contemplating first nearest neighbor interactions, the contrast matrix can be defined as i.e. as the (m − 1) × m matrix such that where m − 1 is the number of unique up to sign differences between adjacent elements of x.In practice there is no need to compute a contrast matrix, although for deriving variational algorithms it is useful to carry L around.As for the matrix defined in (4), it is convenient to design L as a matrix whose number of rows and columns are respectively equal to the number of differences d and the length of x, m, also in higher-dimensional problems.Model (2) also incorporates a Laplace penalization that originates from the auxiliary variables b j > 0, j = 1, . . ., d, and the following result.
Result 1.Let x and b be random variables such that Then x ∼ Laplace (0, σ).
The choice of imposing priors on the difference x ∆ is mainly motivated by the two-dimensional applications on biomedical and archaeological imaging considered in this work.Such a choice gives adequate smoothing for flat regions (i.e., it increases image deblurring), but it may oversmooth discontinuities (see Section 3.2 of Aykroyd et al., 2001, for a detailed discussion on this and remedies).We focus on firstneighbor differences to illustrate efficient computation of the variational algorithm updates through removal of the contrast matrix L as per Section 4. Another widely used neighborhood system is the second order one, which is based on the eight nearest neighbors (e.g., Green, 1990) and also provides ground for efficient algorithm implementations.
If the dimension of x increases, the number of first neighbor differences increases and reduces computational efficiency.This issue is often solved in practice by partitioning the surface where the observations are collected into smaller regions.Sometimes this is also done to facilitate the application of the inverse problem to irregular (e.g., non-rectangular) surfaces.For example, when large archaeological fields are explored it is standard practice to divide the area into grids and examine each grid as soon as it has been surveyed, and the inverse problem reconstruction of each grid is typically extended to part of the neighboring grids in each direction to get a smoother reconstruction (Aykroyd et al., 2001, Section 5).
Figure 1: Example of a 3 × 4 grid of pixels with 9 horizontal differences marked by black arrows and 8 vertical differences in grey.The total number of first order nearest neighbor differences is 17.
The design of the matrix K varies according to the inverse problem characteristics.As illustration and for later use on real biomedical data we consider a Gaussian kernel matrix K and show its application to a simple unidimensional problem.If y is a vector of m recordings from a unidimensional space having one-to-one correspondence with x, the (i, j)th entry of a Gaussian kernel matrix K is given as follows: where δ > 0 is a parameter that governs the amount of blur.Here δ is assumed to be fixed, but it can be estimated within the variational framework by imposing a prior and selected using, for example, a posterior mean estimate.Since this parameter is required to be positive, sensible choices of prior are gamma and inverse gamma distributions (Weir, 1997;Aykroyd et al., 2001).Alternatively, to limit computational costs one could first obtain a discrete approximation of the posterior of δ for a discrete grid of δ values.
To illustrate the effect of variations in δ, consider the Blocks test function (Donoho and Johnstone, 1994;Nason, 2008) from the wavethresh package (Nason, 2022) available in R (R Core Team, 2023).The piecewise constant nature of this function makes estimation a very challenging problem especially when tackled as an inverse problem, but it is well motivated by stratigraphy problems in archaeology (Allum et al., 1999;Aykroyd et al., 2001).Figure 2 shows three examples produced through (1) and the Gaussian kernel (5), with m = 100, σ = 1 and δ = 0, 2, 5.In each plot the red dashed line shows the true Blocks function to estimate.In the plot corresponding to δ = 0 blurring is not added to the generation process and the data points are randomly scattered around the true function.Blurring is introduced when δ > 0 and points scatter around a rounded solid line.The cases where δ = 2 and δ = 5 respectively correspond to moderate and large blurring of the underlying true function and hence moderate and difficult inverse problems.In two dimensions, assuming the observed data is stored in a matrix Y , each element of a Gaussian kernel matrix K links an element of y = vec(Y ) to one of x = vec(X).The entry of K corresponding to a pair (Y ij , X i ′ j ′ ) is given by The effect of δ on blurring is analogous to the one described for unidimensional problems.

Model fitting via variational methods
In this section, we study variational Bayes approximations, specifically MFVB and VMP, for fitting the base model (2).Emphasis is placed onto the message passing on factor graph fragment prescription, which provides the infrastructure for compartmentalized and scalable algorithm implementations.
For studying MFVB and VMP note that the joint density function of all the random variables and random vectors in model (2) admits the following factorization: Both variational inference procedures are based upon approximating the joint posterior density function through a product of approximating density functions.A possible mean field restriction on the joint poste-rior density function of all parameters in (2) is Typically, more restrictive density products facilitate the derivation of a variational algorithm but also negatively impact on the quality of the approximation.On the other hand, less severe restrictions may increase algebraic and computational complexity of the variational algorithms.The one used above provides a good trade-off between tractability and accuracy of the approximation.
The scope of MFVB and VMP is to provide expressions for the optimal q-densities that minimize the Kullback-Leibler divergence between the approximating densities themselves and the left-hand side of (8).The former is based upon a directed acyclic graph interpretation of the model, whereas the latter benefits from a factor graph representation and its subsequent division into factor graph fragments.While the two variational inference procedures lead to ostensibly different iterative algorithms, they converge to the identical posterior density function approximations in the case where the inputs of the algorithms are the same and parameters are updated in the same sequence, since they are each founded upon the same optimization problem (Wand, 2017).Details for fitting model (2) under restriction (8) via MFVB and VMP are shown in Sections 3.1 and 3.2.

Mean field variational Bayes
Mean field variational Bayes is a well established approximate Bayesian inference technique where tractability is achieved through factorization of the approximating density.Here we provide a short introduction to MFVB and we refer the reader to Section 3 of Wand et al. (2011) for fuller details on its derivation.
Let z be a vector of observed data and θ ∈ Θ represent all model parameters.The logarithm of the model marginal likelihood, log p(z), satisfies log p(z) ≥ log p(z; q) ≡ q(θ) log p(z, θ) q(θ) dθ, where p(z; q) is a lower-bound depending on a density function q defined over Θ and the joint density function p(z, θ).It can be shown that maximizing the above lower-bound is equivalent to minimizing the Kullback-Leibler divergence between q(θ) and the joint posterior density function p(θ|z), KL(q(θ) ∥ p(θ|z)) = q(θ) log q(θ) p(θ|z) dθ.
If the approximating density is factorized according to a partition (θ 1 , . . ., θ s ) of θ such that q(θ) = s k=1 q(θ k ), as on the right side of (8), then the optimal approximating densities satisfy where E q(θ\θ k ) denotes the expectation with respect to all the approximating densities except q(θ k ) and θ\θ k represents the entries of θ with θ k omitted.Under mild regularity conditions it can be shown that optimization of the lower bound can be performed via a coordinate ascent scheme converging to a local maximizer.
Figure 3 is a directed acyclic graph representation of model (2), which forms the basis for deriving an MFVB algorithm.The shaded circle corresponds to the observed data, the empty circles correspond to model parameters and auxiliary variables, and the small solid circles are used for hyperparameters.According to (9), the full conditional density functions of the nodes in the directed acyclic graph provide expressions for the optimal approximating densities.For instance, the full conditional density of x arising from model ( 2) and ( 3) is From application of (9) it follows that q * (x) is a N µ q(x) , Σ q(x) density function ( 10) , where µ q(1/σ 2 ε ) , µ q(1/σ 2 x ) and µ q(b) respectively denote the expectations of 1/σ 2 ε , 1/σ 2 x and b computed with respect to the optimal approximating densities q * (σ 2 ε ), q * (σ 2 x ) and q * (b) = d j=1 q * (b j ) arising from the mean field restriction (8).Referring to restriction (8), the other MFVB approximations to the posterior density functions of the parameters and auxiliary variables depicted in the directed acyclic graph have the following optimal forms: q * (b j ) is an Inverse-Gaussian µ q(bj ) , λ q(bj ) density function, for j = 1, . . ., d, (11) q * (a ε ) is an Inverse-χ 2 κ q(aε) , λ q(aε) density function (14) and q * (a x ) is an Inverse-χ 2 κ q(ax) , λ q(ax) density function.
Details about the optimal density function parameters are given in the supplement.Such parameters can be obtained through the MFVB iterative scheme listed as Algorithm 1.

Variational message passing
The idea behind variational message passing as presented in Wand (2017) is that the same approximations in (9) can be achieved by exploiting a convenient factor graph representation of the model.A detailed description of VMP as a method for fitting statistical models that have a factor graph representation is provided in Sections 2-4 of Wand (2017) and briefly summarized here.The same notational conventions of Wand (2017) concerning message passing are used in this work.
A factor graph is an ensemble of factors connected to stochastic nodes by edges.Let f h , h = 1, . . ., r, denote a generic factor and function of one or more stochastic nodes and θ k , k = 1, . . ., s, be a generic stochastic variable represented by a node.If θ k is a neighbor of f h in the factor graph, then the messages passed from f h to θ k and from θ k to f h are functions of θ k and are denoted by m f h →θ k (θ k ) and m θ k →f h (θ k ), respectively.
Algorithm 1 Mean field variational Bayes scheme for fitting the inverse problem model (2), using product density restriction (8).
In the optic of deriving an algorithm, the VMP stochastic node to factor message updates are given by and the factor to stochastic node message updates are where E f h →θ k denotes expectation with respect to the density function neighbors(h) ≡ {k = 1, . . ., s : θ k is a neighbor of f h } and the ←−∝ symbol means that the function of θ k on the left-hand side is updated according to the expression on the right-hand side but that multiplicative factors not depending on θ k can be ignored.Upon convergence of the messages, the optimal q-densities are obtained via The messages above are typically proportional to an exponential family density function and so are such that where T (θ k ) is a sufficient statistic vector, and η f h →θ k and η θ k →f h are the message natural parameter vectors.Then for each parameter θ k , the optimal approximating density q * (θ k ) belongs to an exponential family with natural parameter vector that can be computed at convergence of a VMP algorithm.
The notion of a factor graph fragment, or simply fragment, allows for compartmentalization of algebra and computer code and can be exploited to fit inverse problems via VMP without resorting to calculations involving ( 16) and ( 17) each time a VMP algorithm is derived for a new model.The corresponding factor graph representation of (7) given the density product restriction (8) appears in Figure 4. Colors mark different fragment types, in accordance with the nomenclature presented in Wand (2017) for variational message passing on factor graph fragments and numbers label seven factor graph fragments.Some of these have been studied in previous works.Those numbered 1, 2, 5 and 6 are already catalogued in Maestrini , where the square nodes correspond to the density functions, or factors, at the right-hand side of (7).The circular nodes correspond to stochastic nodes of the q-density factorization in (8).
The fragments are numbered 1 to 6, whereas colors identify different fragments types.
and Wand (2021) as Inverse G-Wishart prior fragments (1 and 6) and iterated Inverse G-Wishart fragments (2 and 5).Fragment number 4 corresponds to the Gaussian likelihood fragment treated in Section 4.1.5 of Wand (2017), whose notation can be aligned with that of model (2) by settings A, θ 1 and θ 2 equal to the current K, x and σ 2 ε , respectively.In the view of VMP, we can just read off from equations ( 38) and (39) of Wand (2017), which themselves originate from (17), and get the following updates for fragment 4 involving factor p(y| x, σ 2 ε ) and stochastic nodes x and σ 2 ε : where, according to the definition in Section 2.7 of Wand (2017), New calculations are needed only for fragment 3 to compose a VMP algorithm for the whole model.In fragment 3, factor p (b) symbolizes the specification but may also represent other penalizations.Examples of alternative shrinkage distributions are displayed in Table 1 and discussed in Section 3.3.The factor p x|b, σ 2 x of fragment 3 corresponds to the possibly improper specification (3).The logarithm of this likelihood factor is Combining ( 23) with the auxiliary variable distributional assumption in ( 22) we get the VMP scheme listed as Algorithm 2 for fitting the penalization part of model ( 2).This originates from first noticing that, as a function of x, (23) can be written as which in light of ( 17) and ( 20) originates the message sent from factor x|b, This message is within the Multivariate Normal family, with where, following ( 18), E ⊠ denotes expectation with respect to the normalization of x and E ⊕ denotes expectation with respect to the normalization of Rewriting ( 23) as a function of σ 2 x and β and applying similar considerations we get to Algorithm 2 (see the supplement for full derivations).
The combination of Algorithm 2 with the Inverse G-Wishart fragment algorithms of Maestrini and Wand (2021) and the Gaussian likelihood fragment algorithm of Wand (2017) gives rise to a full VMP procedure for fitting and approximate inference on model (2).Complete details about the implementation of VMP for fitting the inverse problem model (2) are provided in the supplement.At convergence of the VMP procedure, ( 19) and ( 21) can be applied to obtain the approximating densities in explicit form.For instance, we have which can be rewritten in terms of the common parameters of the approximating density (10) of MFVB by Algorithm 2 Variational message passing inputs, updates and outputs of the penalization likelihood fragment given by (3) and ( 22), and corresponding to factor graph fragment 3 of Figure 4.
x ) .Updates: Outputs: Both MFVB and VMP can achieve the same approximations ( 10)-( 15) since it is possible to establish a direct correspondence between the initialisation values of the two variational procedures.For example, using the hyperparameter input A x in Algorithm 1 for MFVB corresponds to a call of VMP to the Inverse G-Wishart Prior Fragment (Maestrini and Wand, 2021, Algorithm 1) with inputs G diag being a graph corresponding to a diagonal matrix, to initialise the natural parameter vector η p(ax) → ax of the message sent from factor p(a x ) to stochastic node a x .Further details about the initialisation of VMP are provided in the supplement.

Alternative response and penalization distributions
The message passing on factor graph fragments paradigm allows for flexible imposition of non-Normal response distributions.For instance, fragment 4 of Figure 4 can be replaced by one of the likelihood fragments identified in Nolan and Wand (2017), Maestrini and Wand (2018) or McLean and Wand (2019) to accommodate a variety of response distributions such as, for instance, binary-logistic, Poisson, Negative Binomial, t, Asymmetric Laplace, Skew Normal, Skew t and Finite Normal Mixtures.
Other penalization structures can be easily incorporated by varying the distributional assumption on the vector of auxiliary variables b.Neville et al. (2014) studied MFVB inference for three continuous sparse signal shrinkage distributions, namely the Horseshoe, Negative-Exponential-Gamma and Generalized Double Pareto distributions, that can replace the Laplace penalization employed in model (2).References for the development of these sparse shrinkage priors are respectively Carvalho et al. (2010), Griffin and Brown (2011) and Armagan et al. (2013).
The derivation of variational algorithms for models containing these shrinkage distributions can be quite challenging.In fact, the variational algorithm algebraic complexity and inference performance rely upon an accurate choice of their auxiliary variable representations.Neville et al. (2014) propose two alternative auxiliary variable representations for each of the aforementioned shrinkage distributions by making use of either one or two sets of auxiliary variables.Their empirical and theoretical evidence show the

Horseshoe
Table 1: Distributions of the auxiliary variables b j > 0, j = 1 . . .d, that produce the penalizations analyzed in Neville et al. (2014) when introduced in model (2) in lieu of the Laplace penalization.Here the shape parameter λ > 0 is fixed.For each distribution, the MFVB or VMP update for µ q(b) is displayed in the last column.When Algorithm 1 is used to perform MFVB, ζ ≡ τ 2 ; when Algorithm 2 is used for running VMP, ζ ≡ ω 6 .
supremacy of representations based on a single set of auxiliary variables in terms of posterior density approximation and computational complexity for all three cases.If this auxiliary variable representation is chosen, then the three penalizations can be easily imposed in model ( 2) as a replacement to the Laplace distribution by simply modifying the distributional assumption on the auxiliary vector b.Algorithms 1 and 2 can still be used by simply replacing the update for µ q(b) with one of those listed in the last column of Table 1.Some expressions in Table 1 related to the Negative-Exponential-Gamma and Generalized Double Pareto cases depend on the parabolic cylinder function of order ν, D ν (x), and Neville et al. (2014).
Several other penalizations can be imposed on the base inverse problem model.The penalization in model ( 2) is a particular case of the Bayesian lasso of Park and Casella (2008) that makes use of a Gamma prior on the Laplace squared scale parameter.Tung et al. (2019) show the use of MFVB for variable selection in generalized linear mixed models via Bayesian adaptive lasso.Ormerod et al. (2017) develop a MFVB approximation to a linear model with a spike-and-slab prior on the regression coefficients.A detailed discussion on variable selection priors and variational inference fitting goes beyond the scope of this article.However, for the analysis of archaeological data provided in the supplement we show how the Laplace penalization can be easily replaced by a Horseshoe prior without deriving a VMP algorithm from scratch.In the same real data analysis we replace the Normal response assumption with a Skew Normal one.

Streamlined variational algorithm updates
In typical inverse problem applications the number of observations can be very high and a naïve implementation of Algorithms 2 and 1 may lead to a bottleneck due to operations involving a large contrast matrix L and big matrix inversion steps related to K.However, the structure of matrices L and K is such that computationally expensive algorithm updates may be efficiently performed.In this section we propose solutions to simplify algorithm updates and reduce their computational complexity.The results shown here are designed for one-and two-dimensional problems but are applicable to extensions to higher dimensions.

Removal of the contrast matrix
The contrast matrix L is a potentially massive sparse matrix that one does not want to compute or store.The number of rows of L is given by the d differences between the elements of x considered, whereas the number of columns is given by the length m of x.For one-dimensional problems with a first nearest neighbor structure only the two longest diagonals of L have non-zero elements, as shown in (4).The contrast matrix of two-dimensional problems under the same assumptions is sparse and has number of non-zero elements equal to twice the number of differences between elements of x, that is 2d.
The updates of the variational Algorithms 1 and 2 that make use of matrix L have the following form: Lv, where v is a vector; (24) diagonal(LM L T ), where M is a symmetric positive definite matrix; (25) The results presented in the supplementary material allow efficient computation of the MFVB and VMP updates that utilise matrix L in the forms described in ( 24)-( 26).One-and two-dimensional problems are considered and the results can be potentially extended to higher dimensions, e.g. for voxel-type data in three-dimensions.

Sparsification of K
The matrix K is a potentially big matrix whose size depends on the number of observations and the length of x.It is easy to notice from the MFVB scheme presented as Algorithm 1 that the algebraic operations where matrix K appears have the following generic forms: , where s 1 and s 2 are scalar positive numbers; (27) For what concerns VMP, the expression in ( 27) appears in the update for Ω 1 of Algorithm 2, whereas those in ( 28)-( 30) arise in the Gaussian likelihood fragment numbered as fragment 5 in the Figure 4 factor graph.Visual inspection of ( 27)-( 30) suggests it is worth studying the structure of K and K T K for a computationally efficient implementation of the variational algorithm updates.The focus of this section is placed on two-dimensional inverse problems.Again, we restrict our attention to the case where X and Y are both m 1 × m 2 matrices and each element of X has a one-to-one correspondence with an element having the same position in Y .Under these conditions K is a square matrix of size m × m, with m = m 1 m 2 .If model ( 2) is used, setting x = vec(X) and y = vec(Y ), the matrix K has the following structure: Therefore K is a symmetric block-Toeplitz matrix with m 2 unique sub-blocks, each being m 1 × m 1 symmetric Toeplitz matrices.For simple unidimensional problems K is a symmetric Toeplitz matrix.
Both the MFVB and VMP updates of Algorithm 1 and ( 31) require inversion of a matrix of size m × m.From (31) it is easy to notice that the structure of the matrix being inverted is influenced by K through K T K.A possible idea to reduce computational burden induced by these updates is to sparsify K in such a way that also K T K and the final matrix to invert are sparse.Since K linearly links elements in X with those in Y and given the one-to-one correspondence between X and Y , it is reasonable to set to zero the matrix K elements which correspond to interactions between pairs of locations whose distance exceeds a certain truncation value ℓ ∈ N, with ℓ < min(m 1 , m 2 ).This strategy can find application, for instance, both in the context of biomedical imaging discussed in Section 5 and archaeological field survey treated in the supplementary material.More formally, this consists in setting to zero the entries of K that model dependence between pairs of elements of X and Y , The resultant K is an ℓ-block-banded matrix whose sub-blocks are ℓ-banded matrices.Also K T K may result in a sparse matrix for particular choices of ℓ, as stated in the following result.
Result 2. Let A be an ℓ-block-banded matrix of size m × m with ℓ-banded sub-blocks and such that 0 < ℓ < (m − 1)/2.Then A T A is a symmetric 2ℓ-block-banded matrix whose sub-blocks are 2ℓ-banded matrices.
Figure 5 depicts the K matrix and K T K block for an inverse problem where X and Y have size 7 × 10 and K is sparsified by applying a truncation value ℓ = 2.The blue color indicates non-zero matrix entries.In this case, matrix K is 2-block-banded matrix with 2-banded sub-blocks, whereas K T K is a 4-blockbanded matrix with 4-banded sub-blocks.
It is easy to check through Lemma 6 by applying such a sparsification strategy to K T K that the same sparsity structure is imposed to the matrices being inverted in updates ( 31) and (32).Hence, for appropriate choices of ℓ, the updates involve inversion of a sparse matrix having a block-banded structure and banded matrices in the main block-diagonals.The suggested sparsification strategy has a physical interpretation.In the two-dimensional problems under examination each element of Y linearly depends on a subset of the elements of X through K.If the elements of K are set to zero according to a truncation number ℓ, then such subset is given by those elements of X that fall inside of a circle of diameter 2ℓ + 1 around the element of X having oneto-one correspondence with the Y entry.Sparsifying the expression to be inverted in (31) means setting to zero some elements of the precision matrix Ω q(x) = Σ −1 q(x) , where Σ q(x) is the covariance matrix of the optimal approximating density function q * (x) in (10).Since q * (x) is a Multivariate Normal density function, Ω q(x) ij = 0, for i, j = 1, . . ., m, if and only if x i and x j are conditionally independent given all the other elements of x.
Note that differently from the algebraic results proposed for removal of L, the sparsification applied to K comes from nullifying interactions between elements of X and Y , and therefore it introduces another level of approximation to the variational fitting procedure.4.2.1.Block-banded matrix algebra Asif and Moura (2005) propose two algorithms to invert block-banded matrices whose inverses are positive definite matrices: one resorting to Cholesky factors and an alternative implementation that avoids Cholesky factorizations.These algorithms require the inversion of smaller matrices having the size of the block-banded matrix sub-blocks.In the two-dimensional inverse problems under examination these sub-blocks have a banded structure.An approach for inverting banded matrices is described in Kılıc ¸and Stanica (2013).These algebraic approaches for handling block-banded and banded matrices may allow for stable computation of variational algorithm steps involving sparse matrix inversions such as ( 27).However, the simplest way to perform efficient sparse matrix inversion and computations is to employ software for sparse matrix algebra.The functions contained in the R package spam (Furrer and Sain, 2010) allow efficient management of sparse matrices and implement matrix operations.This package can be used in combination with package spam64 Gerber et al. (2017) to speed up such functions in 64 bit machines.Well established software is also available for lower-level languages such as the linear algebra libraries Armadillo and Eigen for C++ coding.
In general, matrices having banded or block-banded inverses are full matrices.Nonetheless, the inverse of a banded matrix may be referred to as band-dominated matrix (Bickel and Lindner, 2012), since the entries of its inverse exponentially decay with the distance from the main diagonal (Demko et al., 1984, Theorem 2.4).This property can be generalized to block-banded matrices and the inverse of a block-banded matrix can be approximated by a block-banded matrix with the same sparsity structure, i.e. with zero blocks off the main block band (Wijewardhana and Codreanu, 2016).The blocks outside the ℓ-block band of a symmetric positive definite matrix having an ℓ-block-banded inverse are called nonsignificant blocks and those in the ℓ-block band are called significant blocks.Theorem 3 of Asif and Moura (2005) states that nonsignificant blocks can be obtained from the significant ones.Then, a possible way to further speed-up algebra involving these matrices and reduce memory usage is to impose a block-banded structure to the precision matrix Ω q(x) and approximate Σ q(x) with a block-banded matrix having the same structure of Ω q(x) .In this case only the significant blocks of the covariance matrix Σ q(x) , which solely depend on the significant blocks of Ω q(x) , need to be computed.

Biomedical data study
This section demonstrates the use of variational inference for two-dimensional inverse problems motivated by a biomedical application.An illustration on a real dataset and on simulations that mimic the real data are provided.
We assess the performances of variational inference through comparison with MCMC and computation of accuracy.For a generic univariate parameter θ, the approximation accuracy of a density q(θ) to a posterior density p(θ|y) is measured through so that 0% ≤ accuracy ≤ 100%, with 100% indicating perfect matching between the approximating and posterior density functions.We compute accuracy using Markov chain Monte Carlo as a benchmark.A standard Metropolis-Hastings algorithm Metropolis et al. (1953); Hastings (1970) is used to produce approximate samples from the posterior distribution.The Markov chains are started at feasible points in the parameter space and the retained samples are used to approximate the corresponding posterior density functions via kernel density estimation.Accuracy is then obtained from (33) with replacement of p(θ|y) by MCMC density estimates of the posterior density functions.Variational inference is performed removing the contrast matrix L through Lemmas 4-6 and sparsifying the matrix K via truncation of interactions, as explained in Section 4. Also our MCMC implementation does not make direct use of matrices L and K to reduce computational burden.The simulation study was run on a personal computer with a 64 bit Windows 10 operating system, an Intel i7-7500U central processing unit at 2.7 gigahertz and 16 gigabytes of random access memory.Variational inference was fully performed in R, whereas MCMC was run in R with subroutines replacing L and K matrix operations implemented in C++.

Real biomedical data
We test the performance of our variational inference approach on a real biomedical application from the realm of tomographic data.Tomography aims to display cross-sections through human and animal bodies, or other solid objects, using data collected around the body.
The data, kindly provided by BioEmission Technology Solutions, Athens, Greece, were collected to illustrate a small animal imaging system, gamma-eye, which can be used in biotechnology and pre-clinical medical research (Georgiou et al., 2017).A technetium radioisotope (99m Tc-MIBI) was injected via the tail vein of a mouse and mainly absorbed by organs such as heart, liver and kidneys, and then excreted.In humans such techniques are used to monitor heart function and mice are often used in pre-clinical studies.A single plane gamma-camera image of an adult mouse was collected with the camera at a distance of 5mm from the nearest point of the mouse and 35mm from the support bed.The 29 × 58, pixels of the data image are 1.7mm apart giving a field of view of about 5cm by 10cm.The mouse was anesthetized and so this corresponds to the "at rest" part of a human scan which would also involve a "stress test".The total data recording time was 3 hours.
The objective is to reconstruct an image by removing blur from the observed scan of the mouse shown and denoted as Y in Figure 6.We adopt model (2) setting y = vec(Y ) and using a Gaussian kernel matrix K obtained through (6) with δ = 0.7.Hence, y and x are vectors of length equal to the number of pixels, 1,682, and K has size 1, 682 × 1, 682.We set to zero the elements of K expressing interactions between locations that have 10 or more pixels between each other using ℓ = 10.The random walk Metropolis-Hastings algorithm is used to produce approximate samples from the posterior distribution by simulating a Markov chain with burn-in of 1000 followed by 100,000, then thinned by a factor of 20.Hyperparameters are set to values that give rise to diffuse priors.Specifically, A ε = A x = 10 5 .Estimates of X obtained via variational inference and MCMC are included in Figure 6.Here the term variational inference refers to both MFVB and VMP, as they both provide the same results by construction.The estimate of X obtained via variational inference corresponds to the inverse vectorization of µ q(x) from (10), whereas that of MCMC is given by the mean of the sampled chains.
Figure 6 also displays approximate posterior density plots for a selection of six representative pixels.Five of the six selected pixels correspond to targeted organs of the mouse, namely thyroid, liver, kidneys and bladder, while the remaining pixel is located outside but near the mouse body.Overall variational approximations provide good image reconstruction and facilitate visualisation of the mouse body shape and organs.The posterior density approximations are also satisfactory in terms of accuracy, that is, area overlap between MCMC.As typical of variational approximations based on mean field restrictions, variational inference underestimates the variance of the approximate posterior densities.Plots of some representative bivariate posterior densities are also provided in the supplement.These show that in all cases the variational approximation covers a smaller area, indicating that it underestimates uncertainty relative to MCMC.

Simulated biomedical data
We employ the real biomedical data image processed through MCMC, the one corresponding to the lower-right panel of Figure 6, to simulate datasets and study the performance of variational inference in comparison with MCMC.In this simulation study we also keep track of computational times and calculate percentages of coverage.For a given parameter, the percentage of coverage corresponds to the proportion of simulations where the true parameter falls inside its 95% credible interval obtained through variational inference.Let X MCMC be the inverse vectorization of the estimate of x obtained as the sample mean of the corresponding MCMC chains.We simulate data through:

Data
with σ ε = 50 and K generated according to (6), using δ values from the set (0.7, 0.8, 0.9) and without truncating interactions (ℓ = ∞).The example plots provided in the supplement show how the blur increases for higher δ.For each δ value we generate 100 datasets and fit model (2) via both MFVB and MCMC.For fitting we apply the same K matrix used to generate the datasets, i.e. without truncation of interactions, but also fit the model again setting to zero the elements of K associated with pairs of observations whose distance exceeds a truncation value ℓ = 5.The MFVB algorithm is stopped when the relative difference of estimates of X goes below 10 −2 between two iterations.We generate Markov chains of length 6,000 and retain 5,000 of them after discarding 1,000 warm-up samples.Table 2 summarizes the results of the simulation study including: accuracy of the variational inference estimates of X versus MCMC; variational inference percentage of coverage for X, i.e. the number of times the entries of X MCMC falls inside their 95% variational inference credible intervals; variational inference and MCMC computational times.Average and standard deviations (in brackets) of each indicator are displayed for each combination of δ and ℓ values.
For each pixel of each simulated dataset we compute the accuracy of the variational approximation using (33) and then average over the 100 replicates.We then calculate average and standard deviation over all the entries of X and display these in Table 2.We repeat the same procedure for measuring the percentage of coverage performances.The mean accuracy values range between 88.07 and 87.10, whereas the mean percentages of coverage are between 95.09 and 92.75 and therefore close to the nominal 95% level.Both accuracy and coverage performances slightly degrade for higher values of δ and blur.
The variational inference and MCMC computational times are displayed in minutes and show that variational inference is around 100 times faster for the δ = 0.7 setting.Imposing a truncation ℓ = 5 to K reduces the MCMC computational times by about six to seven minutes on average for the three δ values under examination as opposed to not applying truncation.The time performances of variational inference are not particularly affected by truncation as the R package spam efficiently manages the algorithm updates involving K in both cases with and without truncation.Once again, variational inference has been fully performed in R, while the main MCMC function has been implemented in R and uses C++ subroutines that replace the K and L matrix operations.Therefore, further computational advantages may be achieved implementing variational inference in C++.Table 2: Results of the simulation study based on 100 replicates per δ value generated using the kernel matrix K defined in (6) without truncation.Fitting is performed via variational inference and MCMC using the same K matrix employed to generate the data (no truncation) and also imposing a truncation threshold ℓ = 5 to the interactions expressed by K. Higher δ values correspond to more blur.The smaller ℓ, the sparser the K matrix used to fit the model is.

Discussion
We have laid down the infrastructure for performing variational approximate inference on applications that can be studied through statistical inverse problems models.Our variational algorithms allow fast approximate fitting for these models, with satisfactory accuracy compared with standard MCMC.
The run-time of MCMC estimation for inverse problems is usually excessive on a standard personal computer.This means that parameter tuning, model diagnostics and sensitivity analyses are rarely performed.The use of variational inference methods, which are quick by comparison, means that multiple parameter settings and multiple models can be considered in a reasonable length of time.This opens-up the possibility of model diagnostics, such as influence and leverage, to become a routine part of applied inverse problems solution.Further, there is a subjective element, as always, to the choice of model components and in particular the hierarchical prior components.It would be a great advance for applications areas, such as medicine and archaeology, if rogue measurements could be identified and their influence on estimation could be quantified.Also, if any arbitrary parts could be shown to have insignificant effect on results, this would lead to far greater confidence and hence a wider acceptability of advanced statistical modelling approaches.
Hence, the implications of our work are not limited to the numerical results presented, but we provide a framework for other researchers to develop a richer set of model exploration methods for inverse problems.The flexibility of our approach is such that non-Normal likelihood distributions and other penalizations can be incorporated at will.Furthermore, this research sets the basis for several future directions to explore such as the study of settings with more than two dimensions, number of observations not matching that of data recording locations or more complex neighbor dependence structures.

Supplement for:
A Variational inference framework for inverse problems BY L. MAESTRINI 1 , R.G. AYKROYD 2 AND M.P. WAND 3 1 Australian National University, 2 University of Leeds and 3 University of Technology Sydney

S.1. Algorithm derivations
This section provides full justification of Algorithms 1 and 2.

S.1.1. Derivation of Algorithm 1
The optimal approximating densities satisfy expression (9).The full conditional density functions of each hidden node in the directed acyclic graph in Figure 3 allow to derive expressions for the optimal approximating densities.First, .

S.1.2. Derivation of Algorithm 2
Consider expression (23) for log p x|b, σ 2 x as a function of the single components x, σ 2 x and b.As a function of x, we get which is within the Multivariate Normal family, with , where E ⊠ denotes expectation with respect to the normalization of x ) σ 2 x and E ⊕ denotes expectation with respect to the normalization of which is within the Inverse Chi-Squared family, where , with E ⊗ denoting expectation with respect to the normalization of As a function of b, From ( 16), and so , which is a product of Inverse Gaussian density functions with natural parameter vector .
It follows from Table S.1 of the supplementary material of Wand ( 2017) that Again from Table S.1 of of Wand (2017), since E ⊠ corresponds to the expectation of an Inverse Chi-Squared distribution with natural parameter vector η p(x| b, σ 2 . Next, Making use of Table S.1 of Wand ( 2017), we get where the expressions for ω 3 and ω 4 are provided in Algorithm 2. Similarly, The expressions for η p(x|b, σ 2 x ) → x and η p(x| b, σ 2 x ) → σ 2 x outputted by Algorithm 2 follow.

S.1.3. Introducing alternative penalizations
One of the continuous distributions for sparse signal shrinkage of Neville et al. (2014) can be used in lieu of the Laplace penalization.This entails replacement of (S.1) for MFVB and (S.2) for VMP with one of the following: The adjustments in Algorithms 1 and 2 that allow for inclusion of such penalizations are summarized in Table 1.

S.2. Implementation of variational message passing
This section demonstrates how to fully implement VMP on the base model, making use of Algorithm 2 and other relevant VMP algorithms for fragments that have already been studied in previous works.
The VMP approach to fit model (2) under restriction (8) takes a response vector y of length m and a K matrix of size (m × m) as data inputs, and A ε , A x > 0 as hyperparameter inputs.At convergence, VMP provides the optimal posterior density function approximations ( 10)-( 15).

S.2.1. Initialization
The message natural parameters arising from the factor graph in Figure 4 have to be initialised at feasible points in the parameter space.
The natural parameter vector η p(ax) → ax can be initialized through the Inverse G-Wishart Prior Frag- ment (Maestrini and Wand, 2021, Algorithm 1) with inputs: The algorithm also provides the graph G p(ax) → ax as an output.An analogous call to the same algorithm provides η p(aε) → aε and G p(aε) → aε .The remaining factor to stochastic node message natural parameters can be initialized, for example, as follows: One way to initialize the stochastic node to factor message natural parameters is the following: .

S.2.2. Variational message passing iterations
Once the natural parameter vector initializations are carried out, the stochastic node to factor and factor to stochastic node message parameters are updated in cycle until convergence.A possible way to assess convergence is monitoring the relative difference of parameter estimates from subsequent iterations.The updates for factor to stochastic node message parameters are performed via Algorithm 2 and other VMP schemes proposed in the existing literature.

S.2.2.1. Stochastic node to factor message parameter updates
The stochastic node to factor message updates follow from ( 16).For the factor graph of Figure ( 4) these updates are: Note that the updates for η aε → p(σ 2 ε |aε) and η ax → p(σ 2 x |ax) remain constant throughout the iterations.

S.2.2.2. Factor to stochastic node message parameter updates
The updates for the parameters of factor to stochastic node messages require use of the VMP algorithms described in Subsection 3.2.The following is a detailed explanation of their usage to obtain the remaining updates.
Use Algorithm 2 of Maestrini and Wand (2021) for the iterated Inverse G-Wishart Fragment with: Shape Parameter Input: 1.
Message Graph Input: Natural Parameter Inputs: Maestrini and Wand (2021) for the iterated Inverse G-Wishart Fragment with: Use Algorithm 2 of the current work with: Data Inputs: y, K.

Natural Parameter Inputs
Use the algorithm of Section 4.1.5 of Wand (2017) for the Gaussian Likelihood Fragment with: Natural Parameter Inputs:

S.2.3. Approximating density functions
After reaching convergence, the remaining task is deriving the optimal approximating densities.The densities of main interest are q * (x), q * (σ 2 ε ) and q * (σ 2 x ), and have the form expressed in ( 10), ( 12) and ( 13) that come from (19).In particular, The q-density common parameters are then the following: Expressions for the other optimal approximating densities can be obtained in a similar manner.

S.3. Removing L from the variational algorithms
We provide results that allow to simplify the variational algorithm updates involving the contrast matrix L by means of simpler computational steps.The results are presented separately for one-and twodimensional problems and make use of the following notation.

S.3.1. One-dimensional case
Consider a one-dimensional inverse problem analyzed via model (2) and suppose x has one-to-one correspondence with a sequence of hidden and equispaced locations on a line.Assume first nearest neighbor differences are modeled via the contrast matrix L 1D defined in (4).Then matrix L can be removed from ( 24) by making use of the following lemma.
Lemma 1.Let v be a vector of length d v and L 1D a (d v − 1) × d v matrix having form (4). Then We can get rid of matrix L from expressions having form (25) through the next lemma.
which is a vector of length d M − 1.
The following lemma simplifies the computation of (26).

Notation
Description total number of differences in X Table S.1: Notation used for two-dimensional inverse problems with first nearest neighbor differences and one-to-one correspondence between X and Y .
Lemma 3. Let w be a vector of length d w and L 1D a (d w − 1) × d w matrix having form (4). Then which is a matrix of size d w × d w .
In the R computing environment Lemmas 1-3 can be implemented with standard base functions.In particular, the function diff(v) automatically produces the result stated in Lemma 1.The expressions originated by Lemmas 2 and 3 can be visualized through the examples provided in Section S.3.3.

S.3.2. Two-dimensional case
Consider the study of bidimensional inverse problems through model ( 2) with x = vec(X) and X being an m 1 × m 2 matrix whose entries correspond to a regular grid of locations.Then a first nearest neighbor contrast matrix, here denoted by L 2D , can be conveniently defined as The single components of L 2D are defined as follows: where L H and L V are matrices of size Lastly, matrices L H 0 and L V 0 have the form (4) identified for the one-dimensional case, and dimensions (m 2 − 1) × m 2 and (m 1 − 1) × m 1 , respectively.If for instance X is of size 3 × 4 as in the example of Figure 1, then Superscripts H and V refer to differences between pairs of locations respectively computed in a horizontal and vertical fashion, given the row-wise and column-wise orientation identified by X.The explicit expression of L 2D for a 3 × 4 matrix of parameters X is provided in Section S.3.4.For clarity, the coefficients that define the dimensions of the L 2D matrix components are summarized in Table (S.1).
Lemmas 4-6 are extensions of Lemmas 1-3 to the two-dimensional case, provided that L 2D has the form identified by (S.3)-(S.5).Vector and matrix dimensions are intentionally emphasized to highlight analogies with one-dimensional problems and set up a framework extendible to higher dimensions.
The next lemma simplifies the computation of expression ( 24) and avoids calculating and storing matrix where , are vectors of length d H and d V , respectively, and u = vec V T .
Expression ( 25) can be efficiently computed using the following lemma.
where s H and s V are vectors of length d H and d V , respectively, such that and and k is a vector of d V indeces obtained as follows: .
The last lemma reduces the computational effort for implementing (26).
Lemma 6.Let w be a vector of length , and L 2D a matrix of size d w × (m 1 m 2 ) having the form defined in (S.3)-(S.5).Then and are vectors of length d H and m 1 m 2 − 1, respectively.
The supplement provides heuristic arguments to prove Lemmas 4-6 and demonstrates the implementation of the two-dimensional case lemmas in R.

S.3.3. Visualization of Lemmas 1-3
Consider a simple one-dimensional inverse problem modeled through (2) where m = 4 and d = m−1 = 3.For this particular example the contrast matrix has size d × m and the following explicit expression:

S.3.4. Visualization and R implementation of Lemmas 4-6
This subsection shows the results stated in Lemmas 4-6 through a simple two-dimensional inverse problem model where X is a matrix of parameters of size m 1 × m 2 with m 1 = and m 2 = 4. Referring to model (2), the length of x = vec(X) is m = m 1 × m 2 .For this particular case the contrast matrix L 2D defined in (S.3) is a matrix of size d × m, hence of size 17 × 12, given that and , where L H and L V are defined through (S.4) by making use of matrices L H 0 and L V shown in (S.5), and C is a commutation matrix of appropriate size.
The objective of the following subsubsections is to visualize the results expressed in Lemmas 4-6 for the particular case under examination.

S.3.4.1. Visualization of Lemma 4
Vector v has length d v = m 1 m 2 = 12 and where where This involves the following two vectors: This subsection provides R code to implement Lemmas A, B and C for two-dimensional inverse problems.We make use of function invvec from package ks (Duong, 2024).Note that this function specifies matrix dimension with the number of columns preceding the number of rows.This is at odds with the definition of vec −1 given in Section 1.2.
Consider, for instance, a two-dimensional dataset of size 50 × 40: The result expressed in Lemma 6 can be computed in R as follows:

S.5. Illustration for archaeological data
We here show how the message passing on factor graph fragments paradigm can be used to move from an inverse problem model analysis to another without deriving a variational inference algorithm from scratch.This is illustrated through data from archaeological magnetometry.In archaeology it is often required to investigate a potential site via geophysical remote sensing methods before any physical excavation is commenced.The model we consider includes a Skew Normal distribution for the response vector and a Horseshoe penalization in replacement to the Normal response and Laplace penalization of model (2).
The data were collected from a mid Iron Age farmstead known as 'The Park' through an archaeological exploration that took place in 1994 at Guiting Power in Gloucestershire, United Kingdom.After data collection, part of the area was also excavated and archaeologists drew an impression of the remains that were brought to light.The archaeological site was partitioned into a grid of 10m×10m squares with the survey axes aligned in the directions of magnetic north and east.A fluxgate gradiometer FM18 with 0.1nT sensitivity was used to collect the data at 0.5m intervals.For each survey square, the gradiometer output was an array of 20 × 20 magnetic anomaly readings corresponding to the difference between the signals detected by the lower and upper sensors.The gradiometer lower sensor was held at 0.2m above the surface of the site and the upper sensor was fixed a further 0.5m higher.The Earth's magnetic field is detected by both sensors, whereas the magnetic field from the buried feature is mostly detected by the lower sensor.Therefore the difference between the two sensors' readings provides the magnetic anomaly due to the hidden feature, together with small random noise components.The random noise may be due to systematic causes such as machine-rounding errors, or non-systematic factors such as disturbance by small stones in the soil or interference from local magnetic sources (Scollar, 1970).
We model the hidden surface through a single layer of rectangular prisms at a fixed burial depth.The scope is to estimate the prisms' magnetic susceptibility in order to separate the constituent epochs of the site and locate relevant artifacts prior to excavation.The subsurface layer is assumed to be fixed at a burial depth of 0.3m, given that the topsoil across the excavated region of Guiting Power was found to be between 0.25 and 0.3m deep.According to the single layer subsurface model, each prism in the layer has the same constant extent, which we set to 0.5m.Although the vertical extent of each of the excavated features vary from 0.45m to 1.6m, the chosen value of prism vertical extent increases the chances of distinguishing lowsusceptibility features.
All this information is relevant to design an appropriate matrix K suitable for inverse problems concerning archaeological data.

S.5.1. K matrix for archaeological data
The matrix K we adopt for the illustration on archaeological data has a more complicated definition and relies upon the spread function defined in Section 2 of Aykroyd et al. (2001).The spread function models magnetic anomalies and depends on latitude and longitude of the archaeological site on the Earth's surface, the geometry of the gradiometer used to survey the area and the site physical properties.Suppose all the magnetic features are located at the same depth below the surface, have the same vertical thickness and the susceptibility is constant along any vertical line through the features, but may vary between horizontal locations.We model the subsurface of the archaeological site as an ensemble of volume elements of equal size, called prisms, each having uniform susceptibility, fixed vertical depth (0.3m) and extent (0.5m), and a square cross section in the horizontal plane.Let (x 1 , y 1 , z 1 ) and (x 2 , y 2 , z 2 ) be the coordinates of opposite vertices of a prism with unit susceptibility, where the x, y and z axes point north, east and vertically downward, respectively.Then the vertical component of the anomaly due to the prism at a point with coordinates (x, y, z) is where B ≈ 4.8 × 10 4 nT (nanoteslas) is the magnitude of the magnetic flux density due to the Earth's field.
The three additive components of ∆Z(x, y, z) are: , where I is the inclination of the Earth's magnetic field and θ is the angle between the direction of magnetic north and the x axis.In our application I = 65°and θ = 0°.
The difference between two simultaneous readings from two sensors is recorded.One sensor is mounted vertically above the other at a distance of 0.5m.Then the recorded reading originated by a prism identified by coordinates where z U is the vertical coordinate of the upper sensor and z L is that of the lower sensor, which are held at 0.7m and 0.2m above the surface throughout the survey.Suppose that a vector y of n readings is recorded over a rectangular site at coordinates s j , j = 1, . . ., n. Assume that the subsurface is divided into a rectangular assemblage of m prisms having coordinates t i , i = 1, . . ., m, and producing susceptibilities that are collected in an m × 1 vector x.Then the influence of the prism i at surface location j is where h is the function defined in (S.6).The linear relationship between y and x is then modeled as E(y) = Kx, where each element of the n × m matrix K is given by (S.7).

S.5.2. Model and VMP Implementation
As affirmed in Aykroyd et al. (2001), the assumptions used to model the hidden surface provide a realistic basis for modeling the data but can be quite restrictive.The recorded magnetic anomaly may include both positive and negative values, generate shifts in the apparent location of features, be asymmetric in shape and vary across different sampling regions.For this reason we model the outcome variable y, i.e. the anomaly detected by the gradiometer, through a Skew Normal distribution and impose a Horseshoe penalization to the difference between hidden susceptibility values x ∆ .The Horseshoe prior shrinks the small signals and enhances the big ones, highlighting the contrast between buried features and plain soil.
The model we fit is the following: ∼ N (0, 1), i = 1, . . ., m, (S.8) The first line of this model gives rise to a Skew Normal likelihood and is equivalent to Here the density function of a random variable z having a Skew-Normal µ, σ 2 , λ distribution is p(z) = (2/σ)ϕ{(z−µ)/σ}}Φ{λ(z−µ)/σ}, with scale parameter σ > 0, skewness parameter λ, and ϕ and Φ denoting the Standard Normal density and cumulative distribution functions.The second line of (S.8) specifies a Horseshoe penalization on the (x ∆ ) j 's, as indicated by Table 1.The priors at line three give rise to conjugate message passing updates for the stochastic nodes of σ ε and λ.As in model ( 2), σ x is assigned a Half-Cauchy(A x ) prior through the auxiliary variable a x .
The joint density function of model (S.8) is given by (S.9) Steps similar to those presented for the base model can be used to implement VMP for the model with Skew Normal responses and Horseshoe prior.The starting point is again the choice of a factorization for the approximating q-densities.We impose the following mean field approximation to the joint posterior: q j (b j ) . (S.10) The right-hand sides of (S.9) gives rise to the factor graph representation displayed as Figure S.5.In this , where the square nodes correspond to the density functions, or factors, on the right-hand side of (S.9).The circular nodes correspond to stochastic nodes of the q-density factorization in (S.10).Numbers are used to show the distinction between fragments, whereas colors identify different fragments types.
factor graph, four of the seven fragments arising from the base model (2), those numbered 1-3, are preserved.Fragment 4 is now the Skew Normal likelihood fragment studied in Section 3.4 of McLean and Wand (2019).The message passed from this fragment to σ 2 ε takes the form and is part of the Sea Sponge family.These two families of distributions are defined in Sections S.2.3 and S.2.5 of the supplementary material of McLean and Wand (2019).Priors on σ 2 ε and λ that are conjugate to these messages must have the form We impose diffuse priors using A ε = B ε = 10 −2 and S λ = 10 5 .
Full implementation of VMP is based on iteratively updating, for each factor graph fragment depicted in Figure S.5: (i) the parameter vectors of messages passed from the fragment's neighboring stochastic nodes to the fragment's factors; (ii) the parameter vectors of the messages passed from the fragment's factors to their neighboring stochastic nodes.The first step is very simple and entails application of (7) of Wand (2017).The factor to stochastic node updates of the second step can be performed through various VMP procedures: • Fragment 1 is an Inverse G-Wishart prior fragment and the factor to stochastic node parameter vector updates can be performed using Algorithm 1 of Maestrini and Wand (2021) with inputs G Θ = G full , ξ Θ = 1 and Λ Θ = (A 2 x ) −1 .
• Fragment 2 is an iterated Inverse G-Wishart prior fragment and the factor to stochastic node parameter vector updates can be performed using Algorithm 2 of Maestrini and Wand (2021) with inputs G = G full , ξ = 1 and G A→p(Σ|A) = G diag .
• The factor to stochastic node parameter vector updates of fragment 3 can be performed through Algorithm 2.
• Fragment 4 is the Skew Normal likelihood fragment and the factor to stochastic node parameter vector updates can be performed using Algorithm 4 of McLean and Wand (2019) with y and K as data inputs.
• Fragment 5 corresponds to the imposition of an Inverse Square Root Nadarajah prior distribution on the variance parameter σ 2 ε .The output of VMP applied to this fragment is the natural parameter vector of the prior density function, that is, η p(σ 2 ε )→σ 2 ε from (S.13).
• Fragment 6 corresponds to the imposition of a Sea Sponge prior distribution on the skewness parameter λ.The output of VMP applied to this fragment is the natural parameter vector of the prior density function, that is, η p(λ)→λ from (S.14).
We restrict our attention to the portion of the archaeological dataset corresponding to the excavated area, which enables a qualitative performance assessment of our fitting method.Figure S.6 displays the data under examination (Y ) together with the results of the application of VMP to model (S.8) and the impression drawn by archaeologists.The data were handled in a disjoint way by separating the two rectangular areas corresponding to indices J and K of the archaeologists' impression.It is standard practice to examine grids as soon as they are collected and this division of the surveyed area allows to partition the data into two full matrices Y J for area J and Y K for area K of size 30 × 20 and 40 × 20, respectively.Model (S.8) can be used setting y = vec(Y J ) or y = vec(Y K ) and the reconstruction X is simply obtained as the inverse vectorization of the VMP estimate of x.We employ the mean of the optimal approximating density q * (x), which is a N µ q(x) , Σ q(x) density function, to estimate x.Expressions for q * (x) and the other q-densities of interest are provided in the supplement.
If compared to the original dataset, the VMP reconstruction of Figure S.6 shows greater contrast between background and features, and some weak features are more evident in the posterior mean reconstruction.Despite the data being treated in a disjoint way, discontinuity in the estimate of X between areas J and K is not very apparent.Careful inspection of the reconstruction shows the locations of some reconstructed features are shifted if compared to their apparent position in the original survey data image.This is important information, considering that each pixel corresponds to a square area of 0.5m side and that archaeological excavations require intensive manual work.The approximate posterior densities of λ seem to indicate that the data from area J is symmetric, whereas that from area K is positively skewed.
To obtain the common parameters of q(σ 2 ε ) and q(σ 2 λ ) from their natural parameters we make use of the results from Sections S.2.3 and S.2.5 of the supplementary material of McLean and Wand (2019) concerning the Inverse Square Root Nadarajah and Sea Sponge distributions.

Figure 3 :
Figure 3: Directed acyclic graph representation of the Normal response model with Laplace penalization in (2).The shaded circle corresponds to the observed data.The unshaded circles correspond to model parameters and auxiliary variables.The small solid circles correspond to hyperparameters.

Figure 4 :
Figure 4: Factor graph representation of the Normal response model with Laplace penalization in (2), where the square nodes correspond to the density functions, or factors, at the right-hand side of (7).The circular nodes correspond to stochastic nodes of the q-density factorization in (8).The fragments are numbered 1 to 6, whereas colors identify different fragments types.

Figure 5 :
Figure 5: Representation of a sparse K matrix of size 70 × 70 for an inverse problem over a pixel grid of size 7 × 10 (left panel) and the corresponding K T K matrix (right panel).These matrices are obtained by using a truncation number ℓ = 2. Non-zero values are depicted in blue.
Figure S.3: Marginal and bivariate marginal posterior densities obtained via variational inference (blue) and MCMC (orange) for Location 1 of Figure 6 and the adjacent pixels on its top, left, bottom and right sides.

Figure S. 4 :
Figure S.4: Examples of data generated in the biomedical simulation study for different values of δ.

Figure S. 5 :
Figure S.5: Factor graph representation of the Skew Normal response model with Horseshoe penalization in (S.8), where the square nodes correspond to the density functions, or factors, on the right-hand side of (S.9).The circular nodes correspond to stochastic nodes of the q-density factorization in (S.10).Numbers are used to show the distinction between fragments, whereas colors identify different fragments types.

Figure S. 6 :
Figure S.6: Results of the archaeological data study conducted via model (S.8) and variational inference.The top-left image displays the archaeological dataset denoted by Y .Its variational inference reconstruction, denoted by X, is shown in the top-right image, whereas the bottom-left image corresponds to the archaeologist's impression of the 1994 excavation of 'The Park'.The plots in the bottom-right side show the variational approximate marginal posterior densities of the Skew Normal variance and skewness parameters for areas J and K of the archaeological site. Y accuracyFigure6: Analysis of the data provided by BioEmission Technology Solutions, Athens, Greece, performed by fitting model (2) via variational inference and MCMC.The panel on the upper-left side displays the observed data Y .The two panels on the left-hand side show, from top to bottom, the reconstruction X of X obtained via variational inference and MCMC.The plots on the lower-left side show the approximate marginal posterior densities produces through variational inference (blue lines) and MCMC (orange lines) for a selection of pixels.