A Bayesian Conjugate Gradient Method

A fundamental task in numerical computation is the solution of large linear systems. The conjugate gradient method is an iterative method which offers rapid convergence to the solution, particularly when an effective preconditioner is employed. However, for more challenging systems a substantial error can be present even after many iterations have been performed. The estimates obtained in this case are of little value unless further information can be provided about the numerical error. In this paper we propose a novel statistical model for this numerical error set in a Bayesian framework. Our approach is a strict generalisation of the conjugate gradient method, which is recovered as the posterior mean for a particular choice of prior. The estimates obtained are analysed with Krylov subspace methods and a contraction result for the posterior is presented. The method is then analysed in a simulation study as well as being applied to a challenging problem in medical imaging.


Introduction
This paper presents an iterative method for solution of systems of linear equations of the form where A ∈ R d×d is an invertible matrix and b ∈ R d is a vector, each given, while x * ∈ R d is to be determined. The principal novelty of our method, in contrast to existing approaches, is that its output is a probability distribution over vectors x ∈ R d which reflects knowledge about x * after expending a limited amount of computational effort. This allows the output of the method to be used, in a principled anytime manner, tailored to reflect a constrained computational budget. In a special case, the mode of this distribution coincides with the estimate provided by the standard conjugate gradient method, whilst the probability mass is proven to contract onto x * as more iterations are performed. Challenging linear systems arise in a wide variety of applications; of these, partial differential equations (PDEs) should be emphasised, as these arise frequently throughout the applied sciences and in engineering [Evans, 2010]. Finite element and finite difference discretisations of PDEs each yield large, sparse linear systems which can sometimes be highly ill-conditioned, such as in the classically ill-posed backwards heat equation [Evans, 2010]. Even for linear PDEs, a detailed discretisation may be required. This can result in a linear system with billions of degrees of freedom and require specialised algorithms to be even approximately solved practically [e.g. Reinarz et al., 2018]. Another example arises in computation with Gaussian measures [Bogachev, 1998, Rasmussen, 2004, in which analytic covariance functions, such as the exponentiated quadratic, give rise to challenging linear systems. This has an impact in a number of related fields, such as symmetric collocation solution of PDEs [Fasshauer, 1999, Cockayne et al., 2016, numerical integration [Larkin, 1972, Briol et al., 2018 and generation of spatial random fields [Besag and Green, 1993, Parker and Fox, 2012, Schäfer et al., 2017. In the latter case, large linear systems must often be solved to sample from these fields, such as in models of tropical ocean surface winds [Wikle et al., 2001] where systems may again be billion-dimensional. Thus, it is clear that there exist many important situations in which error in the solution of a linear system cannot practically be avoided.

Linear Solvers
The solution of linear systems is one of the most ubiquitous problems in numerical analysis and Krylov subspace methods [Hestenes andStiefel, 1952, Liesen andStrakos, 2012] are among the most successful at obtaining an approximate solution at low cost. Krylov subspace methods belong to the class of iterative methods [Saad, 2003], which construct a sequence (x m ) that approaches x * and can be computed in an efficient manner. Iterative methods provide an alternative to direct methods [Davis, 2006, Allaire andKaber, 2008] such as the LU or Cholesky decomposition, which generally incur higher cost as termination of the algorithm after m < d iterations is not meaningful. In certain cases an iterative method can produce an accurate approximation to x * with reduced computational effort and memory usage compared to a direct method.
The conjugate gradient (CG) method [Hestenes and Stiefel, 1952] is a popular iterative method, and perhaps the first instance of a Krylov subspace method. The error arising from CG can be shown to decay exponentially in the number of iterations, but convergence is slowed when the system is poorly conditioned. As a result, there is interest in solving equivalent preconditioned systems [Allaire and Kaber, 2008], either by solving P −1 Ax * = P −1 b (left-preconditioning) or AP −1 P x * = b (right-preconditioning), where P is chosen both so that P −1 A (or AP −1 ) has a lower condition number than A itself, and so that computing the solution of systems P y = c is computationally inexpensive for arbitrary y and c. Effective preconditioning can dramatically improve convergence of CG, and of Krylov subspace methods in general, and is recommended even for well-conditioned systems, owing to how rapidly conjugacy is lost in CG when implemented numerically. One reasonably generic method for sparse systems involves approximate factorisation of the matrix, through an incomplete LU or incomplete Cholesky decomposition [e.g. Ajiz andJennings, 1984, Saad, 1994]. Other common approaches exploit the structure of the problem. For example, in numerical solution of PDEs a coarse discretisation of the system can be used to construct a preconditioner for a finer discretisation [e.g. Bramble et al., 1990]. A more detailed survey of preconditioning methods can be found in many standard texts, such as Benzi [2002] and Saad [2003]. However, no approach is universal, and in general careful analysis of the structure of the problem is required to determine an effective preconditioner [Saad, 2003, p. 283]. At worst, constructing a good preconditioner can be as difficult as solving the linear system itself.
In situations where numerical error cannot practically be made negligible, an estimate for the error x m − x * must accompany the output x m of any linear solver. The standard approach is to analytically bound x m − x * by some function of the residual Ax m − b , for appropriate choices of norms, then to monitor the decay of the relative residual. In implementations, the algorithm is usually terminated when this reaches machine precision, which can require a very large number of iterations, and substantial computational effort. This often constitutes the principal bottleneck in contemporary applications. The contribution of this paper is to demonstrate how Bayesian analysis can be used to develop a richer, probabilistic description for the error in estimating the solution x * with an iterative method. From a user's perspective, this means that solutions from the presented method can still be used in a principled way, even when only a small number of iterations can be afforded.

Probabilistic Numerical Methods
The concept of a probabilistic numerical method dates back to Larkin [1972]. The principal idea is that problems in numerical analysis can be cast as inference problems and are therefore amenable to statistical treatment. Bayesian probabilistic numerical methods  posit a prior distribution for the unknown, in our case x * , and condition on a finite amount of information about x * to obtain a posterior that reflects the level of uncertainty in x * , given the finite information obtained. In contemporary applications, it is common for several numerical methods to be composed in a pipeline to perform a complex task. For example, climate models [such as Roeckner et al., 2003] involve large systems of coupled differential equations. To simulate from these models, many approximations must be combined. Bayesian probabilistic numerical methods are of particular interest in this setting, as a probabilistic description of error can be coherently propagated through the pipeline to describe the structure of the overall error and study the contribution of each component of the pipeline to that error . As many numerical methods rely on linear solvers, such as the CG method, understanding the error incurred by these numerical methods is critical. Other works to recently highlight the value of statistical thinking in this application area includes Calvetti et al. [2018].
In recent work, Hennig [2015] treated the problem of solving Eq. (1) as an infer-ence problem for the matrix A −1 , and established correspondence with existing iterative methods by selection of different matrix-valued Gaussian priors within a Bayesian framework. This approach was explored further in Bartels and Hennig [2016]. There, it was observed that the posterior distribution over the matrix in Hennig [2015] produces the same factors as in the LU or Cholesky decompositions 1 . Our contribution takes a fundamentally different approach, in that a prior is placed on the solution x * rather than on the matrix A −1 . There are advantages to the approach of Hennig [2015], in that solution of multiple systems involving the same matrix is trivial. However we argue that it is more intuitive to place a prior on x * than on A −1 , as one might more easily reason about the solution to a system than the elements of the inverse matrix. Furthermore, the approach of placing a prior on x * is unchanged by any left-preconditioning of the system, while the prior of Hennig [2015] is not preconditioner-invariant.
Contribution The main contributions of this paper are as follows: • The Bayesian conjugate gradient (BayesCG) method is proposed for solution of linear systems. This is a novel probabilistic numerical method in which both prior and posterior are defined on the solution space for the linear system, R d . We argue that placing a prior on the solution space is more intuitive than existing probabilistic numerical methods and corresponds more directly with classical iterative methods. This makes substitution of BayesCG for existing iterative solvers simpler for practitioners.
• The specification of the prior distribution is discussed in detail. Several natural prior covariance structures are introduced, motivated by preconditioners or Krylov subspace methods. In addition, a hierarchical prior is proposed in which all parameters can be marginalised, allowing automatic adjustment of the posterior to the scale of the problem. This discussion provides some generic prior choices to make application of BayesCG more straightforward for users unfamiliar with probabilistic numerical methods.
• It is shown that, for a particular choice of prior, the posterior mode of BayesCG coincides with the output of the standard CG method. An explicit algorithm is provided whose complexity is shown to be a small constant factor larger than that of the standard CG method. Thus, BayesCG can be efficiently implemented and could be used in place of classical iterative methods with marginal increase in computational cost.
• A thorough convergence analysis for the new method is presented, with computational performance in mind. It is shown that the posterior mean lies in a particular Krylov subspace, and rates of convergence for the mean and contraction for the posterior are presented. The distributional quantification of uncertainty provided by this method is shown to be conservative in general.
The structure of the paper is as follows: In Section 2 BayesCG is presented and its inputs discussed. Its correspondence with CG is also established for a particular choice of prior. Section 3 demonstrates that the mean from BayesCG lies in a particular Krylov subspace and presents a convergence analysis of the method. In Section 4 the critical issue of prior choice is addressed. Several choices of prior covariance are discussed and a hierarchical prior is introduced to allow BayesCG to adapt to the scale of the problem. Section 5 contains implementation details, while in Section 6 the method is applied to a challenging problem in medical imaging which requires repeated solution of a linear system arising from the discretisation of a PDE. The paper concludes with a discussion in Section 7. Proofs of all theoretical results are provided in the electronic supplement.

Methods
We begin in Section 2.1 by defining a Bayesian probabilistic numerical method for the linear system in Eq. (1). In Section 2.2 a correspondence to the CG method is established. In Section 2.3 we discuss a particular choice of search directions that define BayesCG. Throughout this paper, note that A is not required to be symmetric positive-definite, except for in Section 2.2.

Probabilistic Linear Solver
In this section we present a general probabilistic numerical method for solving Eq. (1). The approach taken is Bayesian, so that the method is defined by the choice of prior and the information on which the prior is to be conditioned. For this work, the information about x * is linear and is provided by search directions s i , i = 1, . . . , m d, through the matrix-vector products The matrix-vector products on the right-hand-side are assumed to be computed without error 2 , which implies a likelihood model in the form of a Dirac distribution: This section assumes the search directions are given a-priori. The specific search directions which define BayesCG will be introduced in Section 2.3.
In general the recovery of x * from m < d pieces of information is ill-posed. The prior distribution serves to regularise the problem, in the spirit of Tikhonov [1963], Stuart [2010]. Linear information is well-adapted to inference with stable distributions 3 such as the Gaussian or Cauchy distributions, in that the posterior distribution is available in closed-form. Optimal estimation with linear information is also well-understood [cf. Traub et al., 1988]. To proceed, let x be a random variable, which will be used to model epistemic uncertainty regarding the true solution x * , and endow x with the prior distribution where x 0 and Σ 0 are each assumed to be known a-priori, an assumption that will be relaxed in Section 4. It will be assumed throughout that Σ 0 is a symmetric and positivedefinite matrix.
Having specified the prior and the information, there exists a unique Bayesian probabilistic numerical method which outputs the conditional distribution p(x|y m )  where y m = [y 1 , . . . , y m ] satisfies y m = S m Ax * = S m b, and S m denotes the matrix whose columns are s 1 , . . . , s m . This is made clear in the following result: Proposition 1 (Probabilistic Linear Solver). Let Λ m = S m AΣ 0 A S m and r 0 = b−Ax 0 . Then the posterior distribution is given by This provides a distribution on R d that reflects the state of knowledge given the information contained in y m . The mean, x m , could be viewed as an approximation to x * that might be provided by a numerical method. From a computational perspective, the presence of the m × m matrix Λ −1 m could be problematic, as this implies a second linear system must be solved, albeit at a lower cost O(m 3 ). This could be addressed to some extent by updating Λ −1 m iteratively using the Woodbury matrix inversion lemma, though this would not reduce the overall cost. However, as the search directions can be chosen arbitrarily, this motivates a choice which diagonalises Λ m , to make the inverse trivial. This will be discussed further in Section 2.3.
Note that the posterior distribution is singular, in that det(Σ m ) = 0. This is natural since what uncertainty remains in directions not yet explored is simply the restriction, in the measure-theoretic sense, of the prior to the subspace orthogonal to the columns of S m A. As a result, the posterior distribution is concentrated on a linear subspace of R d . Singularity of the posterior makes computing certain quantities difficult, such as posterior probabilities. Nevertheless, Σ m can be decomposed using techniques such as the singular-value decomposition, so sampling from the posterior is straightforward.
Define a generic inner-product of two vectors in R d by x, x M = x M x , with associated norm · M . Note that for this to define a norm, it is required that M be a positive-definite matrix. The following basic result establishes that the posterior covariance provides a meaningful connection to the error of x m , when viewed as a point estimator: Thus the right hand side provides an upper bound on the relative error of the estimator x m in the Σ −1 0 -norm. This is a weak result and tighter results for specific search directions are provided later. In addition to bounding the error x m − x * in terms of the posterior covariance Σ m , we can also compute the rate of contraction of the posterior covariance itself: The combination of Propositions 2 and 3 implies that the posterior mean x m is consistent and, since the posterior covariance characterises the width of the posterior, Proposition 3 can be viewed as a posterior contraction result. This result is intuitive; after exploring m linearly independent search directions, x * has been perfectly identified in an m-dimensional linear subspace of R d . Thus, after adjusting for the weighting of R d provided by the prior covariance Σ 0 , it is natural that an appropriate measure of the size of the posterior should also converge at a rate that is linear.

Correspondence with the Conjugate Gradient Method
In this section we examine the correspondence of the posterior mean x m described in Proposition 1 with the CG method. It is frequently the case that Bayesian probabilistic numerical methods have some classical numerical method as their mean, due to the characterisation of the conditional mean of a probability distribution as the L 2 -best element of the underlying space consistent with the information provided [Diaconis, 1988.
The Conjugate Gradient Method A large class of iterative methods for solving linear systems defined by positive-definite matrices A can be motivated by sequentially solving the following minimisation problem: where K m is a sequence of m-dimensional linear subspaces of R d . It is straightforward to show that this is equivalent to: is a convex quadratic functional. Let S m ∈ R d×m denote a matrix whose columns are arbitrary linearly independent search directions s 1 , . . . , s m , with range(S m ) = K m . Let x 0 denote an arbitrary starting point for the algorithm. Then x m = x 0 + S m c for some c ∈ R m which can be computed by solving ∇f (x 0 + S m c) = 0. This yields: In CG [Hestenes and Stiefel, 1952] the search directions are constructed to simplify the inversion in Eq. (7) by imposing that the search directions are A-conjugate, that is, s CG i , s CG j A = 0 whenever i = j. A set {s i } of A-conjugate vectors is also said to be A-orthogonal, while if the vectors additionally have s i A = 1 for each i they are said to be A-orthonormal. For simplicity of notation, we will usually work with A-orthonormal search directions, but in most implementations of CG the normalisation step can introduce stability issues and is therefore avoided.
Supposing that such a set of A-orthonormal search directions can be found, Eq. (7) simplifies to which lends itself to an iterative numerical method: Search directions are also constructed iteratively, motivated by gradient descent on the function f (x), whose negative gradient is given by This construction leads to search directions s CG 1 , . . . , s CG m which form an A-orthonormal set. Eq. 8 makes clear the following proposition, which shows that for a particular choice of prior the CG method is recovered as the posterior mean from Proposition 1: Proposition 4. Assume A is symmetric and positive-definite. Let x 0 = 0 and Σ 0 = A −1 . Then, taking S m = S CG m , Eq. (5) reduces to x m = x CG m . This result provides an intriguing perspective on the CG method, in that it represents the estimate produced by a rational Bayesian agent whose prior belief about x * is modelled by x ∼ N (0, A −1 ). Dependence of the prior on the inaccessible matrix inverse is in accordance with the findings in Hennig [2015] (Theorem 2.4 and Lemma 3.4), in which an analogous result was presented. As observed in that paper, the appearance of A −1 in the prior covariance is not practically useful, as while the matrix inverse cancels in the expression for x m , it remains in the expression for Σ m .

Search Directions
In this section the choice of search directions for the method in Proposition 1 will be discussed, initially by following an information-based complexity [Traub et al., 1988] argument. For efficiency purposes, a further consideration is that Λ m should be easy to invert. This naturally suggests that search directions should be chosen to be conjugate with respect to the matrix AΣ 0 A , rather than A. Note that this approach does not require A to be positive-definite, as AΣ 0 A is positive-definite for any non-singular A. Two choices of search direction will be discussed: Optimal Information One choice is to formulate selection of S m in a decision-theoretic framework, to obtain optimal information in the nomenclature of . Abstractly, denote the probabilistic numerical method discussed above by P [·; µ, S m ] : R d → P(R d ), where P(R d ) is the set of all distributions on R d . The function P [b; µ, S m ] takes a right-hand-side b ∈ R d , together with a prior µ ∈ P(R d ) and a set of search directions S m and outputs the posterior distribution from Proposition 1. Thus P [b; µ, S m ] is a measure and P [b; µ, S m ](dx) denotes its infinitesimal element.
For general µ ∈ P(R d ), define the average risk associated with the search directions S m to be where L(x, x * ) represents a loss incurred when x is used to estimate x * . This can be thought of as the performance of the probabilistic numerical method, averaged both over the class of problems described by µ and over the output of the method. Optimal information in this paper concerns selection of S m to minimise R(S m , µ). The following proposition characterises optimal information for the posterior in Proposition 1 in the case of a squared-error loss function and when x 0 = 0. Proposition 5. Suppose µ = N (0, Σ 0 ) and consider the squared-error loss L(x, x * ) = x − x * 2 M where M is an arbitary symmetric positive-definite matrix. Optimal information for this loss is given by where Φ m is the matrix whose columns are the m leading eigenvectors of M The dependence of the optimal information on A − is problematic except for when M = A A, which corresponds to measuring the performance of the algorithm through the residual Ax m − b 2 2 . While this removes dependence on the inverse matrix, finding the search directions in this case requires computing the eigenvectors of AΣ 0 A , the complexity of which would dominate the cost of computing the posterior in Proposition 1.
Conjugacy A second, more practical method for obtaining search directions that diagonalise Λ m is similar to that taken in CG. Search directions are constructed which are conjugate to the matrix AΣ 0 A by following a similar procedure to that described in Section 2.2.
Proposition 6 (Conjugate Search Directions =⇒ Iterative Method). Assume that the search directions are AΣ 0 A -orthonormal. Denote r m = b − Ax m . Then, x m in Eq. (5) simplifies to while to compute Σ m in Eq. (6) it suffices to store only the vectors Σ 0 A s j , for j = 1, . . . , m.
On the surface, the form of this posterior differs slightly from that in Proposition 1, in that the data are given by s m r m−1 rather than s m r 0 . However, when search directions are conjugate, the two expressions are equivalent: Use of s m r m−1 reduces the amount of storage required compared to direct application of Eq. (5). It also helps with stability as, while search directions can be shown to be conjugate mathematically, the accumulation of numerical error from floating point precision is such that numerical conjugacy may not hold, a point discussed further in Section 5. An approach to constructing conjugate search directions for our probabilistic linear solver is now presented, again motivated by gradient descent. This is termed a Bayesian conjugate gradient method for the same reason as in CG, as search directions are chosen to be the direction of gradient descent subject to a conjugacy requirement, albeit a different one than in standard CG. In the context of Proposition 4, note that the search directions obtained coincide with those obtained from CG when A is symmetric positive-definite and Σ 0 = A −1 . Thus, BayesCG is a strict generalisation of CG. Note, however, that these search directions are constructed in a data-driven manner, in that they depend on the right-hand-side b. This introduces a dependency on x * through the relationship in Eq. 1 which is not taken into account in the conditioning procedure and leads to conservative uncertainty assessment, as will be demonstrated in Section 6.1.

BayesCG as a Krylov Subspace Method
In this section a thorough theoretical analysis of the posterior will be presented. Fundamental to the analysis in this section is the concept of a Krylov subspace.
For a vector w ∈ R d , the shifted Krylov subspace is defined as It is well-known that CG is a Krylov subspace method for symmetric positive-definite matrices A [Liesen and Strakos, 2012], meaning that It will now be shown that the posterior mean for BayesCG, presented in Proposition 6, is a Krylov subspace method. For convenience, let K * m : Proposition 9. The BayesCG mean x m satisfies This proposition gives an alternate perspective on the observation that, when A is symmetric positive-definite and Σ 0 = A −1 , the posterior mean from BayesCG coincides with x CG m : Indeed, for this choice of Σ 0 , K * m coincides with x 0 + K m (A, r 0 ) and furthermore, since under this choice of Σ 0 the norm minimised in Proposition 9 is · A , it is natural that the estimates x m and x CG m should be identical. Proposition 9 allows us to establish a convergence rate for the BayesCG mean which is similar to that which can be demonstrated for CG. Let κ(M ) = M 2 M −1 2 denote the condition number of a matrix M in the matrix 2-norm. Now, noting that κ(Σ 0 A A) his well-defined, as Σ 0 and A are each nonsingular, we have: This rate is similar to the well-known convergence rate which for CG, in which κ(Σ 0 A A) is replaced by κ(A). However, since it holds that κ(A A) ≥ κ(A), the convergence rate for BayesCG will often be worse than that for CG, unless Σ 0 is chosen judiciously to reduce the condition number of κ(Σ 0 A A). Thus it appears that there is a price to be paid when uncertainty quantification is needed. This is unsurprising, as it is generally the case that uncertainty quantification is associated with additional cost over methods for which uncertainty quantification is not provided.
Nevertheless, the rate of convergence in Proposition 10 is significantly faster than the rate obtained in Proposition 2. The reason for this is that knowledge about how the search directions S m were chosen has been exploited. The directions used in BayesCG are motivated by gradient descent on f (s). Thus, if gradient descent is an effective heuristic for the problem at hand, then the magnitude of the error x m − x * will decrease at a rate which is sub-linear. The same cannot be said for tr(Σ m Σ −1 0 ) which continues to converge linearly as proven in Proposition 3. Thus, the posterior covariance will in general be conservative when the BayesCG search directions are used. This is verified empirically in Section 6.1.

Prior Choice
The critical issue of prior choice is now examined. In Section 4.1 selection of the prior covariance structure will be discussed. Then in Section 4.2 a hierarchical prior will be introduced to address the scale of the prior.

Covariance Structure
When A is symmetric positive-definite, one choice which has already been discussed is to set Σ 0 = A −1 , which results in a posterior mean equal to the output of CG. However correspondance of the posterior mean with CG does not in itself justify this modelling choice from a probabilistic perspective and moreover this choice is not practical, as access to A −1 would give immediate access to the solution of Eq. Eq. (1). We therefore discuss some alternatives for the choice of Σ 0 .
Natural Prior Taking inspiration from probabilistic numerical methods for PDEs [Cockayne et al., 2016, Owhadi, 2015, another natural choice presents itself: The object through which information about x * is extracted is b, so it is natural, and mathematically equivalent, to place a relatively uninformative prior on the elements of b rather than on This prior is as impractical as that which aligns the posterior mean with CG, but has the attractive property that convergence is instantaneous when the search directions from Proposition 7 are used. To see this, observe that Thus this prior is natural, in that when using the search directions from Proposition 7, convergence occurs in one iteration.
Preconditioner Prior For systems in which a preconditioner is available, the preconditioner can be thought of as providing an approximation to the linear operator A. Inspired by the impractical natural covariance (A A) −1 , one approach proposed in this paper is to set Σ 0 = (P P ) −1 , when a preconditioner P can be found. Since by design the action of P −1 can be computed efficiently, so too can the action of Σ 0 . As mentioned in Section 1.1, the availability of a good preconditioner is problem-dependent.
Krylov Subspace Prior The analysis presented in Section 3 suggests another potential prior, in which probability mass is distributed according to an appropriate Krylov subspace K n (M, b). Consider a distribution constructed as the linear combination where w := (w 0 , . . . , w n ) ∼ N (0, Φ) for some positive-definite matrix Φ. The distribution on x K induced by Eq. (12) is clearly Gaussian with mean 0. To determine its covariance, note that the above expression can be rewritten as is the matrix whose columns form a basis of the Krylov subspace One issue with this approach is that the computation of the matrix K n is of the same computational complexity as n iterations of BayesCG, requiring n matrix-vector products. To ensure that this cost does not dominate the procedure, it is necessary to take n < m d. However, in this situation x * / ∈ K n (b, M ), so it is necessary to add additional probability mass on the space orthogonal to K n (M, b), to ensure that x * lies in the prior support. To this , and let K ⊥ n denote a matrix whose columns span for a scaling parameter ϕ ∈ R. Then, the proposed Krylov subspace prior is given by Practical issues associated with this approach are now discussed: • Choice of M : It seems natural to place mass on the Krylov subspace which x m occupies, that is K m (Σ 0 A A, Σ 0 A r 0 ), but dependence of this subspace on the prior covariance that is being computed makes this choice circular. An alternative choice is to choose M so that the projection of x * into K m (M, b) converges rapidly in m to the truth. The choice M = A seems natural, due to the known rapid convergence of CG, and corresponds to a prior encoding of the intuition that "CG search directions tend to work well". Another option would be to take M = P −1 A for some preconditioner P , but this final choice was not explored.
• Selection of Φ and ϕ: When M = A, the theoretical bound for the relative error of CG can be used to choose the elements of Φ, given in Proposition 10 when Σ 0 = A −1 . Let ξ < 1. Then we propose to choose Φ to be a diagonal matrix, with diagonal entries Φ ii = 2σξ i 2 where σ ∈ R is a scale parameter. Based upon Proposition 10, the ideal choices for ξ and σ are ξ = κ(A)−1 κ(A)+1 and σ = x * A . However, since these two quantities are not typically known a priori, estimates must in practice be used. The remaining parameter, ϕ, determines how much weight is given to the orthogonal component. Given the choice of Φ, it is natural to choose ϕ < [2σξ i+2 ] 2 ; this encodes the user's prior belief about how many iterations of standard CG are "typically needed".
• Computation of K ⊥ n : Computation of the complement of K n is equivalent to finding a basis of the set of all vectors v for which K n v = 0; that is, computing the null-space of K n . This can be accomplished by QR decomposition. Recall that a QR-decomposition produces matrices Q ∈ R d×d and R ∈ R d×n such that K n = QR. The matrix Q can be used to determine the null space of K n . After partitioning the matrix Q as Q = [Q 1 , Q 2 ], where Q 1 ∈ R d×(n+1) and Q 2 ∈ R d×(d−n−1) are respectively the first n + 1 and last d − n − 1 columns of Q, it holds that Q 2 is an orthonormal matrix whose columns form a basis of the required null-space.

Covariance Scale
For the distributional output of BayesCG to be useful it must be well-calibrated. Loosely speaking, this means that the true solution x * should typically lie in a region where most of the posterior probability mass is situated. As such, the scale of the posterior variance should have the ability to adapt and reflect the difficulty of the linear system at hand. This can be challenging, partially because the magnitude of the solution vector is apriori unknown and partially because of the aforementioned fact that the dependence of S m on x * is not accounted for in BayesCG.
In this section we propose to treat the prior scale as an additional parameter to be learned; that is we consider the prior model where x 0 , Σ 0 are as before, while ν ∈ R + . This can be viewed as a generalised version of the prior in Eq. (4), which is recovered when ν = 1. In this section we consider learning ν in a hierarchical Bayesian framework, but we note that ν could also be heuristically calibrated. An example of such a heuristic procedure is outlined in Section S3.
The approach pursued below follows a standard approach in Bayesian linear regression [Gelman et al., 2014]. More generally, one could treat the entire covariance as unknown and perform similar conjugate analysis with an inverse-Wishart prior, though this extension was not explored. Consider then endowing ν with Jeffreys' (improper) reference prior: The conjugacy of this prior with the Gaussian distribution is such that the posterior marginal distributions p(ν|y m ) and p(x|y m ) can be found analytically. For the following proposition, IG denotes an inverse-gamma distribution, while MVT m denotes a multivariate t distribution with m degrees of freedom.
Proposition 11 (Hierarchical BayesCG). When p(x|ν) and p(ν) are as above, the posterior marginal for ν is given by while the posterior marginal for x is given by When the search directions are AΣ 0 A -orthonormal, this simplifies to where ν m := S m r 0 2 2 /m.
Since r 0 reflects the initial error x 0 − x * , the quantity ν m can be thought of as describing the difficulty of the problem. Thus in this approach the scale of the posterior is data-dependent.

Implementation
In this section some important details of the implementation of BayesCG are discussed.
Numerical Breakdown of Conjugacy In standard CG, as well as for the sequentiallycomputed search directions from Proposition 7, while the search directions are conjugate in exact arithmetic, the propagation of floating point error is such that numerical conjugacy can fail to hold. This is due to the iterative way in which the method is implemented. The implication is that, while mathematical convergence is guaranteed in d iterations, m > d iterations may be required in practice for both the CG estimate to converge to x * and the BayesCG posterior to contract around it. When used in practise, CG is only run for m d iterations to mitigate the impact of conjugacy breakdown. This further highlights the importance of preconditioners, to ensure rapid convergence.
This phenomenon can be mitigated to some extent by the modification described in Eq. (11) which ensures that "local" conjugacy is exploited [Meurant, 2006]. While the new direction may not be numerically conjugate to r 0 , it is likely to be numerically conjugate to r m−1 by the construction in Proposition 7. Thus, computing the quantity s m r m−1 rather than s m r 0 promotes continued stability and convergence of the posterior mean x m .
The impact on the posterior covariance also deserves comment. In the regime when d > m, σ m takes complex values so the posterior covariance will no longer be meaningful. The underlying issue is that the fact that Λ m = I when the search directions are not perfectly conjugate, and so the simplification exploited in Proposition 6 induces overconfidence in the resulting posterior. Since the same issue arises in search directions obtained from CG, we note that the work of Hennig [2015], which also exploits conjugacy, is likely to suffer from the same deficiency. However a full treatment of floating point error in the context of BayesCG is rather technical and is deferred for future work. In this direction inspiration may be taken from Parker and Fox [2012], where a similar analysis was performed.
To circumvent these issues for the purposes of our experiments we introduce the batchcomputed search directions obtained by a Gram-Schmidt orthogonalisation procedure 4 : The batch-computed search directions are mathematically identical to the BayesCG search directions {s i } m i=1 . However, by explicitly orthogonalising with respect to all m − 1 previous directions, this batch procedure ensures that numerical conjugacy is maintained.
Computational Cost The cost of BayesCG is a constant factor higher than the cost of CG as three, rather than one, matrix-vector multiplications are required. Thus, the overall cost is O(md 2 ) when the search directions from Proposition 7 are used. When the batch-computed search directions are used an additional loop of complexity O(m) must be performed. Thus, the cost of the BayesCG algorithm with batch-computed search directions is O(m 2 d 2 ). Note that each of these costs assumes that A and Σ 0 are dense marices; in the case of sparse matrices the cost of the matrix-vector multiplications is driven by the number of nonzero entries of each matrix rather than the dimension d.
Termination Criteria An appealing use of the posterior distribution might be to derive a probabilistic termination criterion for BayesCG. Recall from Proposition 2 that x m approaches x * at a rate bounded by σ m := tr(Σ m Σ −1 0 ), and from Proposition 3 that To decide in practice how many iterations of BayesCG should be performed we propose a termination criterion based upon the posterior distribution from Proposition 11: Thus, termination when σ m < , for some tolerance > 0 that is user-specified, might be a useful criterion. However, Proposition 2 is extremely conservative, and since Proposition 10 establishes a much faster rate of convergence for in the case of BayesCG search directions, this is likely to be an overcautious stopping criterion in the case of BayesCG. Furthermore, since this involves a data-driven estimate of scale, Algorithm 1 Computation of the posterior distribution described in Proposition 6. The implementation is optimised compared to that given in Proposition 6; see Supplement S2 for detail. Further note that, for clarity, all required matrix-vector multiplications have been left explicit, but for efficiency these should be calculated once-per-loop and stored. Σ m can be computed from this output as Σ if r m 2 < then return x m , Σ F , ν m 21: end procedure the term ν m is not uniformly decreasing with m. As a result, in practise we advocate using a more traditional termination criterion based upon monitoring the residual; see Golub and Van Loan [2013, Section 11.3.8] for more detail. Further research is needed to establish whether the posterior distribution can provide a useful termination criterion.
Full pseudocode for the BayesCG method, including the termination criterion, is presented in Algorithm 1. Two algebraic simplifications have been exploited here relative to the presentation in the main text; these are described in detail in Section S2 of the supplement. A Python implementation can be found at github.com/jcockayne/bcg.

Numerical Results
In this section two numerical studies are presented. First we present a simulation study in which theoretical results are verified. Second we present an application to electrical impedance tomography, a challenging medical imaging technique in which linear systems must be repeatedly solved.

Simulation Study
The first experiment in this section is a simulation study, the goals of which are to empirically examine the convergence properties of BayesCG and to compare the output of the algorithm against the probabilistic approach of Hennig [2015].
For our simulation study, a matrix A was generated by randomly drawing its eigenvalues λ 1 , . . . , λ d from an exponential distribution with parameter γ. A sparse, symmetricpositive definite matrix with these eigenvalues was then drawn using the MATLAB function sprandsym. The proportion of non-zero entries was taken to be 20%. Subsequently, a vector x * was drawn from a reference distribution µ ref on R d , and b was computed as b = Ax * . Throughout, the reference distribution for x * was taken to be µ ref = N (0, I). For this experiment d = 100 and γ = 10. In all cases the prior mean was taken to be x 0 = 0. The prior covariance was alternately taken to be Σ 0 = I, Σ 0 = A −1 and Σ 0 = (P P ) −1 where P was a preconditioner found by computing an incomplete Cholesky decomposition with zero fill-in. This decomposition is simply a Cholesky decomposition in which the (approximate) factorL has the same sparsity structure as A.
The preconditioner is then given by P =LL . The matrixL can be computed at a computational cost of O(nnz(A) 3 ) where nnz(A) is the number of nonzero entries of A. Furthermore, P −1 is cheap to apply because its Cholesky factor is explicit. In addition, the Krylov subspace prior introduced in Section 4.1 has been examined. While it has been noted that the choice Σ 0 = A −1 is generally impractical, for this illustrative example A −1 has been computed directly. Additional experimental results which apply the methodology discussed in this section to higher-dimensional problems is presented in Section S4. Figure 1 the convergence of the posterior mean x m from BayesCG is contrasted with that of the output of CG, for many test problems x * with a fixed sparse matrix A. As expected from the result of Proposition 9, the convergence of the BayesCG mean vector when Σ 0 = I is slower than in CG. In this case, the speed of convergence for BayesCG is gated by κ(A A) which is larger than the corresponding κ(A) for CG. The a priori optimal search directions also appear to yield a slower rate than the BayesCG search directions, owing to the fact that they do not exploit knowledge of b. Similarly as expected, the posterior mean when Σ 0 = A −1 is identical to the estimate for x m obtained from CG. The fastest rate of convergence was achieved when Σ 0 = (P P ) −1 , which provides a strong motivation for using a preconditioner prior if such a preconditioner can be computed, though note that a preconditioned CG method would converge at a yet faster rate gated by κ(P −1 A).

Point Estimation In
In the lower row of Figure 1 the convergence is shown when using batch-computed directions. Here convergence appears to be faster than when using the sequentiallycomputed directions, at correspondingly higher computational cost. The batch-computed directions provide an exact solution after m = d iterations, in contrast to the sequentiallycomputed directions, for which numerical conjugacy may not hold.
Convergence for the Krylov subspace prior introduced in Section 4.1 is plotted in the right-hand column. The size of the computed subspace was set to n = 20, with σ = x * A and ξ = κ(A)−1 κ(A)+1 , as these quantites are easily computable in this simplified setting. The remaining parameter was set to γ = 0.01, to ensure that low prior weight was given to the remaining subspaces. With the sequentially computed directions significant numerical instability is observed starting at m = 20. This does not occur with the batch computed directions, where a jump in the convergence rate is seen at this iteration.
Posterior Covariance In this section the full posterior output from BayesCG is evaluated. In Figure 2, the convergence rate of tr(Σ m ) is plotted for the same set of problems just described to numerically verify the result presented in Proposition 3. Note that while Figure 1 has its y-axis on a log-scale, Figure 2 uses a linear scale. It is clear that when the more informative CG or BayesCG search directions are used, the rate of contraction in the posterior mean does not transfer to the posterior covariance. In the remaining columns of the figure, tr(Σ m ) appears to contract at a roughly linear rate, in contrast to the exponential rate observed for x m . This indicates that tightening the bound provided in Proposition 3 is unlikely to be possible. Furthermore, in the last two columns of Figure 2, the impact of numerical non-conjugacy is apparent as the posterior covariance takes on negative values at around m = 20.

Uncertainty Quantification
We now turn to an assessment of the quality of the uncertainty quantification (UQ) being provided. The same experimental setup was used as in the previous sections, however rather than running each variant of BayesCG to m = d, instead we ran these until m = 10 to ensure that uncertainty quantification (UQ) is needed. To avoid the issue of negative covariances seen in Figure 2, the batch-computed search directions were used throughout.
First, the Gaussian version of BayesCG from Proposition 6 was evaluated. To proceed we used the following argument: When the UQ is well-calibrated, we could consider x * as plausibly being drawn from the posterior distribution N (x m , Σ m ). Note that Σ m is of rank d−m, but assessing uncertainty in its null space is not of interest as in this space x * has been determined exactly. Since Σ m is positive semidefinite, it has the singular-value decomposition where 0 m,n denotes an m × n matrix of zeroes, D ∈ R (d−m)×(d−m) is diagonal and U ∈ R d×d is an orthogonal matrix. The first d − m columns of U , denoted U d−m , form a basis of range(Σ m ), the subspace of R d in which x * is still uncertain. Under this hypothesis we can therefore derive a test statistic where here I n ∈ R n×n is the identity matrix. Note that the pre-factor U d−m is not necessary in the final expression as this norm is unitarily invariant. Thus to evaluate the UQ we can draw many test problems x * ∼ µ ref , evaluate the test statistic Z(x * ) and compare the empirical distribution of this statistic to χ 2 d−m . If the posterior distribution is well-calibrated we expect that the empirical distribution of the test statistic will resemble χ 2 d−m . An overly-conservative posterior will exhibit a "left-shift" in its density, as x m is closer to x * than was expected. Likewise, an overly confident posterior will exhibit a "right-shift".
In Figure 3 the empirical distribution of the statistic Z was compared to its theoretical distribution for different prior covariances. The empirical distributions were plotted as kernel density estimates based upon the computed statistic for 500 sampled test problems. Clearly the a priori optimal directions provide well-calibrated UQ, while for BayesCG the UQ provided by the posterior was overly-conservative for the prior covariances Σ 0 = I, A −1 and (P P ) −1 . This reflects the fact that the search directions encode knowledge of b, but this knowledge is not reflected in the likelihood model used for conditioning, as discussed following Proposition 7. Furthermore, note that the quality of the UQ seems to worsen as the convergence rate for x m improves, with Σ 0 = (P P ) −1 providing the most conservative UQ.
For the Krylov subspace prior, which encodes intuition for how search directions are selected, better UQ was provided. Though the empirical distribution of Z is not identical to the theoretical distribution, the supports of the two distributions overlap. Thus, while the Krylov subspace prior does not fully remedy the issue caused by the use of b in the search directions, some improvement is seen through the incorporation of knowledge of b into the prior.
Next we assessed the UQ provided by the multivariate t posterior presented in Proposition 11. A similar procedure was followed to the Gaussian case, with a different test In the present setting, µ = x m and Σ = Σ m . Furthermore S 2 2 ∼ χ 2 d−m . Lastly, multiplying both sides by m/(d − m) we have The ratio on the right-hand-side is known to follow an F (d − m, m) distribution. In Figure 4 the empirical distribution of the test statistic Z(x * ) was compared to the F (d − m, m) distribution for each of the posterior distributions considered. Again, the posterior distribution based on the a priori optimal search directions was well-calibrated, while the posteriors from BayesCG trade fast convergence in mean with well-calibrated UQ. As before, BayesCG with the Krylov subspace prior appears to provide the bestcalibrated UQ of the (practically useful) priors considered. Note that in both Fig. 3 and Fig. 4, for the choice Σ 0 = (P P ) −1 , which has the most rapidly converging mean in Fig. 1, poor UQ properties are observed, making this otherwise appealing choice impractical. To address this we have explored a heuristic procedure for setting ν m , which aims to match the posterior spread to an appropriate estimate of the error x m − x * 2 . This procedure is reported in Section S3, along with experimental results based upon it.
Comparison to Earlier Work In this section BayesCG is compared to the method proposed in Hennig [2015], which is also briefly recalled here; for full details see the cited work. Rather than performing inference on x * , Hennig [2015] proposed to treat the matrix inverse A −1 as the unknown object. Let A ∈ R d 2 denote the vectorisation of A, defined to be the column vector formed by stacking the transposed rows of A vertically; that is: Assume that A is symmetric positive definite and let H = A −1 . Then a Gaussian prior was placed on H with mean H 0 and covariance W W , the symmetric Kronecker product of a symmetric postive-definite matrix W ∈ R d×d with itself. The observations for the inference were defined by search directions S m and data Y m , each a matrix in R d×m and such that AS m = Y m . The posterior distribution H|Y m is then given by where W m is another symmetric positive definite matrix. To connect the approach of Hennig [2015] to our work, observe that a probability model for A in Ax = b induces a probability model for x as the quantities are deterministically coupled: Thus, projection of the matrix-valued posterior given in Hennig [2015] onto the solution space R d is straightforward and does not involve costly computation of the symmetric Kronecker product.
For the experiments that follow, the prior distribution was taken to be H 0 = W = I. The system A was a sparse symmetric positive-definite matrix generated randomly as previously described. The true solution x * was drawn from a multivariate Gaussian distribution N (0, I), again as described. To ensure a fair comparison, the prior distribution for BayesCG was taken to be the projection of the matrix-valued prior of Hennig [2015] onto the solution space R d , i.e. x 0 = b and Σ 0 = b b · I + bb . Note that Hennig [2015] recommend to use an implicit prior to enforce concordance with the CG method. In this experiment the goal was simply to compare the posterior distributions produced by the two approaches, so an arbitrary but equivalent prior was sufficient. The search directions used to form the posterior in  were taken to be the same as the search directions output by BayesCG. Batch-computed search directions were again used. Thus, the posteriors for the two methods are based upon the same prior assumptions at the level of x * . The information provided to each method is given by the same search directions, but is not equivalent as the projection applied to the search directions is different for each method. In Figure 5 the convergence of x m from BayesCG was compared to that of the posterior mean in the solution space implied by the matrix-valued method. As might be expected, the convergence rate was approximately the same between the two methods, though the posterior means are not identical. This is due to the aforementioned fact that the information in each example is not equivalent. In Hennig [2015] the information is S m = A −1 Y m , while in BayesCG it is given by S m Ax * = S m b. Thus, the information in BayesCG is obtained by left-multiplying the information in Hennig [2015] by b . Since this projection is not invertible, more information is available to the method of Hennig [2015] than BayesCG, so we should not expect the posteriors to be identical. The instability in the left-hand plot is owing to the non-conjugacy of the directions from BayesCG for the matrix-valued method, which introduces a requirement to solve a linear system. Hennig [2015] propose alternative search directions which eliminate this linear system solve, but this would prevent a fair comparison with BayesCG.
In Figure 6 the UQ provided by the Gaussian version of BayesCG and the method of Hennig [2015] were compared using the approach previously described. Again both approaches provide very conservative UQ compared to the theoretical distribution, with each highly peaked close to zero. The empirical density provided by Hennig [2015] appears to be slightly closer to the theoretical distribution, but both are far from where the theoretical distribution is concentrated so that the difference is inconsequential.
However, while the properties of the two methods for equivalent priors is empirically similar, the range of prior covariance structures introduced and examined in this paper is more difficult to include in the method of Hennig [2015]. In particular, it is less clear how the information exploited in the Krylov subspace prior should be used to inform prior choice in the approach of Hennig [2015], as this knowledge about the solution space rather than A −1 . Furthermore, while knowledge of P −1 provides information about the location of A −1 , it is unclear that this choice of prior mean in Hennig [2015] would result in improved performance as it does in BayesCG. Thus, we believe that the increased flexibility and intuition in prior choice makes BayesCG a more attractive method to a user than the method of Hennig [2015].

Electrical Impedance Tomography
Electrical impedance tomography (EIT) is an imaging technique used to estimate the internal conductivity of an object of interest [Somersalo et al., 1992]. This conductivity is inferred from measurements of voltage induced by applying stimulating currents through electrodes attached to its boundary. EIT was originally proposed for medical applications as a non-invasive diagnostic technique [Holder, 2004], but it has also been applied in other fields, such as engineering .
The physical relationship between the inducing currents and resulting voltages can be described by a PDE, most commonly the complete electrode model (CEM) [Cheng et al., 1989]. Consider a domain D ⊂ R n representing the object of interest, where typically n = 2 or n = 3. Denote by ∂D the boundary of D, and let σ(z) denote the conductivity field of interest, where z ∈ D. Denote by {e l } L l=1 the L electrodes, where each e l ⊂ ∂D and e l ∩ e m = ∅ whenever l = m. Let v(z) denote the voltage field, and let {I i,l } L l=1 denote the set of stimulating currents applied to the electrodes. Let {V σ i,l } L l=1 denote the corresponding voltages, and let n denote the outward-pointing normal vector on ∂D. The subscript i here is to distinguish between multiple stimulation patterns which are generally applied in sequence and are of relevance to the inversion problem for determining σ(z) later. Denote by {ζ l } L l=1 the contact impedance of each electrode. The contact impedances are used to model the fact that the contact between the electrode and the boundary of the domain is imperfect. Then the CEM is given by z ∈ e l , l = 1, . . . , L.
A solution of this PDE is the tuple (v(z), V σ i,1 , . . . , V σ i,L ), consisting of the interior voltage field and the voltage measurements on the electrodes. The numerical solution of this PDE can be reduced to the solution of a linear system of the form in Eq. (1), as will shortly be explained.
Having specified the PDE linking stimulating currents to resulting voltages, it remains to describe the approach for determining σ(z) from noisy voltage measurements. These physical voltage measurements are denoted by the matrix V ∈ R L×(L−1) , where V i,l is the voltage obtained from stimulation pattern i at electrode l. The recovery problem can be cast in a Bayesian framework, as formalised in Dunlop and Stuart [2016]. To this end, a prior distribution for the conductivity field is first posited and denoted µ σ . Then, the posterior distribution µ V σ is defined through its Radon-Nikodym derivative with respect to the prior as where Φ(σ; V ) is known as a potential function and exp(−Φ(σ; V )) is the likelihood. This posterior distribution is for an infinite-dimensional quantity-of-interest and is generically nonparametric, thus sampling techniques such as the preconditioned Crank-Nicolson (pCN) algorithm Cotter et al. [2013] are often employed to access it. Such algorithms require repeated evaluation of Φ(σ; V ) and thus the repeated solution of a PDE. Thus, there is interest in ensuring that Φ(σ; V ) can be computed at low cost. Figure 8a and is due to Isaacson et al. [2004]. This is described in detail in the supplement. In the absence of specific data on the accuracy of the electrodes, and for convenience, the observational noise was assumed to be Gaussian with standard deviation δ = 1. This implies a potential of the form:

Experimental Setup The experimental set-up is shown in
where V σ is the matrix with (i, l)-entry V σ i,l . The notation V ∈ R L(L−1) denotes the vectorisation of V , as introduced in Section 6.1.
Apart from in pathological cases, there is no analytical solution to the CEM and thus evaluating Φ(σ; V ) requires an approximate solution of Eq. (13). Here a finite-element discretisation was used to solve the weak form of Eq. (13), as presented in Dunlop and Stuart [2016] and described in more detail in the supplement. This discretisation results in a sparse system of equations Ax * = b, where A is in this context referred to as a stiffness matrix. To compute A and b, standard piecewise linear basis functions were used, and the computations were perfomed using the FEniCS finite-element package. A fine discretisation of the PDE will necessarily yield a high-dimensional linear system to be solved. The idea proposed here is to use BayesCG to approximately solve the linear system, and propagate the solver uncertainty from BayesCG into the the inverse problem associated with recovery of the conductivity field. In essence, this provides justification for small values of m to be used in the linear solver and yet ensure that the inferences for σ remain valid.
The Gaussian version of BayesCG was used throughout, as described in Proposition 6. Thus, assume that the output from BayesCG is x ∼ N (x m , Σ m ). The finite element approximation to the voltages V σ i,l is linearly related to the solution x * of the linear system, so that BayesCG implies a probability model for the voltages of the form V σ ∼ N ( V σ m , Σ σ m ) for some V σ m and Σ σ m ; for brevity we leave these expressions implicit. The approach proposed is to derive a new potentialΦ, obtained by marginalising the posterior distribution output from BayesCG in the likelihood. It is straightforward to show that, for the Gaussian likelihood, this marginalisation results in the new potential Thus, the new likelihood exp(−Φ(σ; V )) is still Gaussian, but with a covariance inflated by Σ σ m to account for the level of inaccuracy in the BayesCG solver. It will be shown that replacing Φ withΦ leads in turn to a posterior distributionμ V σ for the conductivity field which is appropriately widened to account for the additional uncertainty modelled in BayesCG.
Throughout this section the prior distribution over the conductivity field was taken to be a centered log-Gaussian distribution, log(σ) ∼ GP(0, k), with a Matérn 5/2 covariance as given by: The length-scale parameter was set to = 1.0, while the amplitude a was set to a = 9.0 to ensure that where the posterior distribution is concentrated has significant probability mass under the prior.
Forward Problem Initially the solution to the forward problem, i.e. solving the PDE, was considered for a particular stimulation pattern. In Figure 7 the convergence of the point estimates provided by BayesCG and CG were compared, both with the isotropic prior covariance Σ 0 = I and with the preconditioner covariance Σ 0 = (P P ) −1 . For discretising the PDE, we used mesh resolutions N d = 64, 128 and 256, where a larger N d provides a more accurate discretisation. The precise method of generating the meshes and the meaning of N d is described in the supplement. For the conductivity field we examined both samples from µ σ and the mean of µ V σ , denotedσ and obtained using MCMC with a fine discretisation of the PDE and an accurate linear solver. As in the previous section, the preconditioner P was given by an incomplete Cholesky factorisation of A.
The results in Figure 7 show that when Σ 0 = I convergence is slow, but that this is improved when the preconditioner prior is used. This is the same as observed in the previous section, but since this problem is now obtained from an applied example rather than being given by an arbitrary sampled system, it is useful to know that the same observations transfer. The convergence of the estimator from the perspective the conductivity field σ is displayed in the supplement. Inverse Problem In this section, the solution to the inverse problem when using the BayesCG potentialΦ is compared to the posterior obtained from the exact potential Φ. In the latter case CG was used to solve the system to convergence to provide a bruteforce benchmark. For BayesCG, the prior was centered, x 0 = 0, and the preconditioner prior covariance, Σ 0 = (P P ) −1 , was used. BayesCG was run to m = 80 iterations, (a) Set-up for the experiment described in Section 6.2.  Figure 8: Comparison of the posterior distribution over the conductivity field, when using BayesCG to solve the linear system arising from the forward problem compared to using standard CG.
for the mesh with N d = 64. This mesh results in a linear system with d = 311, so 80 iterations represents a relatively small amount of computational effort. In Figure 8 the posterior distribution over the conductivity field is displayed. In Figures 8b and 8c, respectively, the exact posterior mean and the posterior mean from BayesCG are plotted. Note that, as indicated in the previous section, many of the features of the conductivity field have been recovered even though a relatively small number of iterations have been performed. In Figure 8d the ratio of the pointwise posterior standard deviation from BayesCG to that in CG is plotted. Clearly, throughout the entire spatial domain, the posterior distribution has a larger standard deviation, showing that the posterior uncertainty from BayesCG has successfully been transferred to the posterior over the conductivity field. This results in a posterior distribution which is wider to account for the fact that an imperfect solver was used to solve the forward problem. Overall, the integrated standard deviation over the domain is 0.0365 for BayesCG, while for the exact posterior it is 0.0046. This example illustrates how BayesCG could be used to relax the computational effort required in EIT in such a way that the posterior is widened to account for the imperfect solution to the forward problem. This setting, as well as other applications of this method, should be explored in more detail in future work.

Conclusion and Discussion
In this paper we have introduced and theoretically analysed the Bayesian conjugate gradient method, a Bayesian probabilistic numerical method for the solution of linear systems of equations. Given the ubiquity of linear systems in numerical computation, the question of how to approximate their solution is fundamental. Contrary to CG and other classical iterative methods, BayesCG outputs a probability distribution, providing a principled quantification of uncertainty about the solution after exploring an m-dimensional subspace of R d . Through the numerical example in Section 6.2 we have shown how this output could be used to make meaningful inferences in applied problems, with reduced computational cost in terms of iterations performed. This could be applied to a broad range of problems in which solution of large linear systems is a bottleneck, examples of which have been given Section 1.1 Prior Choice Prior choice was discussed in detail. An important question that arises here is to what extent the form of the prior can be relaxed. Indeed, in many applied settings information is known about x * which cannot be encoded into a Gaussian prior. A common example of this arises in the solution of PDEs, when it is often the case that the true solution to the PDE is sign-constrained. However, in encoding additional prior information it is likely that the conjugacy properties exploited to construct a closedform posterior will be be lost. Then, interrogating such posteriors would require sampling techniques such as the numerical disintegration procedure of , which would incur a dramatically higher cost. Research to determine what prior knowledge can be encoded (either exactly or approximately) without sacrificing numerical performance will be an important future research direction.
It was shown how a numerical analyst's intuition that the conjugate gradient method "tends to work well" can be encoded into a Krylov-based prior. This went some way towards compensating for the fact that the search directions in BayesCG are constructed in a data-driven manner which is not explicitly acknowledged in the likelihood. Alternative heuristic procedures for calibrating the UQ were explored in the supplement, Section S3. An important problem for future research will be to provide practical and theoretically justified methods for ensuring the posterior UQ is well-calibrated.

Computational Cost and Convergence
The computational cost of BayesCG is only a constant factor higher than that of CG. However, the convergence rates reported in Section 3 can be slower than those of CG. To achieve comparable convergence rates, the prior covariance Σ 0 must be chosen to counteract the fact that the rate is based on κ(Σ 0 A A) rather than κ(A), and this can itself incur a substantial computational cost. Future work will focus on reducing the cost associated with BayesCG.

S1 Proof of Theoretical Results
Proof of Proposition 1. Note that the joint distribution of x and y m is given by , from which the stated conditional distribution is deduced.
Proof of Proposition 2. Consider an arbitrary vector ℓ ∈ R d . Now and so: (S1) where the last line follow from Cauchy-Schwarz. Now, by expanding the term ( * ) and simplifying, we see that which follows from Eq. 6 Finally let e i denote the vector whose j th entry is δ ij and note that where the last line uses the fact that the trace is invariant under cyclic permutation of the argument.
Proof of Proposition 3. Note that where the third line uses the fact that the trace is invariant under cyclic permutation of the argument.
Proof of Proposition 4. First, note that since the columns of S CG m are A-orthonormal. Then, from Proposition 1 we have Proof of Proposition 5. We first introduce the concept of an average-case optimal algorithm and average-case optimal information. The information space B and the solution space X are, informally, the spaces in which the right-hand-side and the solution of the system live, respectively. We wish to computationally approximate an intractable solution operator A(b), based upon a finite amount of information provided by the information operator S m : B → R m . This is accomplished by an algorithm ψ(S m (b)), which we hope approximates A(b) well in a way which will now be made formal.
For a reference measure ν on B, denote the average-case error of an algorithm ψ with information S m as An algorithm ψ * which minimises e avg (·, ψ) for arbitrary S m is said to be averagecase-optimal. An S * m which minimises e avg (S m , ψ * ) is said to be average-case optimal information.
By Theorem 3.3 of , in the present setting optimal information for the average risk in Eq. (10) is identical to average-case optimal information. This is by virtue of the fact that, for any symmetric positive-definite M , (R d , ⟨·, ·⟩ M ) forms an inner-product space. Now recall two relevant theorems from Novak and Woźniakowski [2008]. For measurable spaces (B, F B ) and (X, F B ), an operator A : B → X and a measure µ on B, let A # µ denote the pushforward of µ through A, a measure on X defined as Theorem S1 (Theorem 4.28 of Novak and Woźniakowski [2008]). Let B be a separable real Banach space equipped with a zero-mean Gaussian measure ν with covariance operator C ν . Let the solution operator A : B → X be a bounded linear operator into a separable real Hilbert space X with inner product ⟨·, ·⟩ X . Let η = A # ν be a Gaussian measure on solution elements. Consider linear information S m = [s 1 , . . . , s m ] where s i : B → R and s i (C ν s j ) = δ ij , and consider information y i = s i (b). Then the algorithm is average-case optimal.
Theorem S2 (Theorem 4.30 of Novak and Woźniakowski [2008]). Under the assumptions of Theorem S1, for b ∈ B the optimal information S * m is given by We will first establish that the posterior mean from Proposition 1 represents an average-case optimal algorithm, by applying Theorem S1. In the notation of that theorem, B = X = R d , which satisfies the required assumptions as R d is separable. The measure ν is given by ν = A # µ ∼ N (0, AΣ 0 A ⊤ ), so that C ν = AΣ 0 A ⊤ . Furthermore the information operator S m is simply a matrix in R d×m , which is subject to the restriction from Theorem S1 that Λ m = S ⊤ m AΣ 0 A ⊤ S m = I. Note that this is markedly similar to the conjugacy requirement in Section 2.2. Now we seek the optimal algorithm ψ(b) which minimises ∫ is of the form required by Theorem S1, with the solution operator A = M 1 2 A −1 , which is a bounded linear operator as required. For any S m conjugate to AΣ 0 A ⊤ , the optimal algorithm is therefore given bȳ In this conjugate setting with x 0 = 0, this is identical to the expression for x m in Proposition 1. Theorem S2 can now be applied to determine the optimal information S * m . Note that since A is a bijection, , with the eigenvectors normalised so that (ϕ * i ) ⊤ ϕ * i = 1. It then holds from Theorem S2 that the optimal search directions are given by Lastly, noting that the scaling by (γ * i ) − 1 2 does not affect the output yields the result that the optimal information is given by Proof of Proposition 6. First, note that Λ m = I as the search directions {s i }, i = 1, . . . , m are Q-orthonormal, where Q = AΣ 0 A ⊤ . Then, from Eq. 5: It therefore remains to show that s ⊤ m r 0 = s ⊤ m r m−1 . To this end, from Eq. 5 we have which completes the proof.
Proof of Proposition 7. Lett 1 := r 0 , and for each m > 1, definet m as where Q = AΣ 0 A ⊤ . Let t m =t m /∥t m ∥ Q . We will show, inductively, that for each m the set of search directions {t i } m i=1 is Q-orthonormal, and further that each t i = s i , as defined in the proposition statement.
For m = 1 the set {t 1 } is trivially Q-orthonormal and t 1 = s 1 . For m > 1 suppose {t i } m−1 i=1 is Q-orthonormal and such that t i = s i , for i = 1, . . . , m − 1. Then, for j < m which shows that the set {t i } m i=1 is Q-orthonormal. As a result we can apply Proposition 6 to show that Since the set {t i } m i=1 is Q-orthonormal, we have from Lemma S3 that for each j ≤ m, r ⊤ m t j = 0. Thus, from Eq. (S5) for each j ≤ m: from which we conclude that r ⊤ m r j = 0 whenever j < m. It follows that Eq. (S6) is zero for all j < m − 1. Thus, all terms in the summation in Eq. (S4) vanish apart from the last, and we are left witht m = r m−1 − (r ⊤ m−1 Qt m−1 )t m−1 which is equal tos m for each m > 1, completing the proof.
Proof of Proposition 11. First the posterior marginal for ν is computed. Note that . Now to determine the posterior marginal for x Eq. S8 is recognised as the integral of an unnormalised inverse-Gamma density, so that and therefore . Proof is by induction, with the additional inductive claims that Note that Eq. (S9) implies the required result by Proposition 1. Let Q = AΣ 0 A ⊤ . For m = 1, the first search direction is given by Now for the inductive step. Assume that Equations (S9) and (S10) hold true up to m − 1. From Proposition 7 we have that where inclusion in the Krylov subspaces is by the inductive assumption. It follows that Σ 0 A ⊤ s m ∈K m−1 . Lastly, observe that which by the inductive assumption is inK m , as required.
Proof of Proposition 9. Let Q = AΣ 0 A ⊤ . Begin with m = 1. Any x ∈ K * 0 can be represented as x 0 + α 1 Σ 0 A ⊤ r 0 for some α 1 . Thus, when x ∈ K * 1 : Setting this to zero, we obtain: From Proposition 6, this corresponds to x = x 1 . It is further clear that As a result, for m > 1 it suffices to determine α m in Again, α m is determined directly, much as above: which is also a minimum. Thus, we have that arg min from Proposition 6, which completes the proof.
Proof of Proposition 10. We begin by introducing the operator norm induced by the energy norm ∥ · ∥ A , which is a norm on matrices M ∈ R d×d From Proposition S4 it holds that there exists a polynomialP m−1 of degree m − 1 such that where P m is some polynomial of degree m. Thus where ∥V ∥ op I ∥V ⊤ ∥ op I = 1 follows since V is unitary. Let P m denote the set of all polynomials of order m with the property that P (0) = 1 for each P ∈ P m . This requirement ensures that if A is singular, ∥e m ∥ Σ −1 0 = ∥e 0 ∥ Σ −1 0 for all m. Now, from Proposition 9 we have that P m ∈ P m is constructed to minimise the error e m . LetΓ denote the set of eigenvalues of Σ |P (γ)| (S12) Lemma S5, proven below, establishes that the polynomial minimising this expression is where T m (·) is the m th Chebyshev polynomial of the first kind. Let κ = γ max /γ min . Now, T m (z) ∈ [−1, 1] for all m and all z ∈ [−1, 1]; thus the numerator takes maximum value 1. Therefore Lastly, note that by definition Inserting this into Eq. (S11) and recalling that since Σ 1 2 0 A ⊤ AΣ 1 2 0 has the same eigenvalues as Σ 0 A ⊤ A, it also has the same condition number, completes the proof.
Lemma S5 (Appendix S3 of Shewchuk [1994]). Eq. (S12) is minimised by where T m is the m th Chebyshev polynomial of the first kind.

S2 Further Simplication of BayesCG
In this section, the simplifications mentioned in Section 5 and exploited in Algorithm 1 are described in detail. First, two coefficients must be calculated, one to update x m and one to updates m . Note that for stability reasons we work with un-normalized rather than normalized search directions where possible. As usual let Q = AΣ 0 A ⊤ , and express these quantities as Now, using the expression fors m , note that since, from Lemma S3,s ⊤ m r m = 0. Furthermore, from the proof of Proposition 7, we have These two simplifications allow rearranging the expressions in Proposition 6 into Algorithm 1.

S3 Additional Numerical Results for Simulation Study
In this section we discuss an empirical procedure for calibrating the scale of the posterior covariance, in an attempt to compensate for the fact that search directions depend upon x * , and show that this results in better calibrated UQ. The proposed approach is to construct an error indicator over the course of the algorithm, and then use this to adjust an appropriate measure of spread of the posterior to match that error prediction.

Constructing the Error Indicator
The aim here is to construct a proxy for the true error by measuring the convergence of the BayesCG mean. Let The idea is to perform a simple regression on the values {z i } m i=1 and use the fitted model ν(i) to extrapolate the error forward. Justified by the exponential convergence rate of BayesCG, as well as its simplicity, a log-linear function ν(i) = exp(a + bi) has been used.
To derive our error indicator we use the following triangle inequality bound: Thus α m provides an approximate upper-bound for ∥x m − x * ∥ 2 .
2. Fitting the Posterior Next we adjust the spread of the posterior, in a somewhat ad-hoc manner, based on the approximate upper-bound α m on the true error. This requires the posterior spread to be quantified, and for the ease of computability we used trace(ν m Σ m ). Thus, to be concrete, we would like to select ν m so that the spread .
Note that, since α m appears in the numerator and provides an approximate upper bound for the true error, the UQ provided will still be conservative in general.

Results
The UQ computed in the main text, Fig. 3, can be compared with the UQ under this proposal shown in Fig. S1. The UQ under this proposal is substantially better calibrated than when the Jeffrey's prior is used in all cases apart from the case when Σ 0 = A −1 . However since this performs well for all the practical choices of prior covariance suggested, the results indicate that approaches based on heuristic calibration of posterior spread could be used to compensate for the fact that the search directions have been constructed in a data-driven manner.

S4 Experimental Results for Higher-Dimensional Systems
In this section additional experimental results are reported for higher-dimensional systems than those considered in Section 6.1. The experimental protocol adopted in that section is challenging to adapt to higher-dimensional systems, as the method for generating sparse positive-definite matrices has empirically been found to produce numerically singular matrices when d is increased. As a result, in this section we will adopt a more structured approach to generating random systems, based on discretisation of a simple elliptic PDE. Specifically, the PDE considered is the following PDE with random boundary conditions: where log f (z) ∼ GP(0, k(z, z ′ ; ρ)) and k(z, z ′ ; ρ) is a Matérn covariance function: .
For the purposes of this experiment the length-scale was fixed to ρ = 0.1. Eq. (S14) was discretised with the finite-element method using standard piecewise linear basis functions as implemented in FEniCS, as in Section 6.2. The domain was discretised using a simple regular triangular mesh over the domain resulting in d elements.
To investigate the performance of BayesCG as a function of the dimension of the system, three different discretisation levels were used: d = 121, d = 1089 and d = 10201.
A subset of the priors from Section 6.1 were considered. The prior Σ 0 = A −1 was not used, as for d = 10201 computing this is impractical. Similarly, the procedure we have used for calculating the Krylov subspace prior in that section requires knowledge of κ(A), which is also impractical. As a result we have focussed on the prior Σ 0 = (P ⊤ P ) −1 where P is, as in Section 6.1, a preconditioner based on an incomplete Cholesky factorisation of A. Results for Σ 0 = I are also included.
Results for convergence of the posterior mean are reported in Fig. S2. Note that the number of iterations required for both CG and BCG with the preconditioner prior seems to increase sub-linearly with dimension, while with Σ 0 = I there is qualitatively little difference as a function of dimension. Furthermore note that the rate of convergence of CG appears to overtake that of the preconditioner prior as the dimension increases, suggesting that the quality of the incomplete Cholesky preconditioner decays with dimension. However, we note that this is only one choice of preconditioner, and others preconditioners may behave better.
The quality of the UQ as a function of dimension is displayed in Fig. S3, with the statistic Z as described in Section 6.1. The UQ after ⌊ d 10 ⌋ iterations was considered. The quality of UQ provided is seen to decrease as a function of dimension d.

S5 Experimental Set-Up for EIT
The results presented in this paper used experimental data provided by EIDORS 1 and due to Isaacson et al. [2004]. In the experiment, depicted in Figure 8a, three targets were placed into a tank filled with saline, two of which are lung-shaped and one of which is heart-shaped. The lung-shaped targets have lower conductivity than the surrounding saline, while the heart-shaped target has higher conductivity. A total of 32 electrodes were placed around the boundary of the domain, and stimulated with 31 distinct stimulation patterns as described in Isaacson et al. [2004]. For each stimulation, the voltage induced at every electrode was recorded, and there are thus 32 × 31 distinct measurements on which the prior must be conditioned. The inducing currents and measured voltages were each supplied in the referenced dataset.
In the simulations the circular tank was modelled as a unit circular domain, and the electrodes were assumed to occupy precisely 1/64 th of the boundary. Thus, each electrode had length π/32 and there was a distance of π/32 between each neighbouring pair of electrodes on the boundary. Since no information is known on the quality of the electrode contact, we set the contact impedances to an arbitrary value, ζ l = 1 for each l.
The trangulations required to discretise the PDE were generated using the Python package meshpy, configured to ensure that there were N d equally sized elements on the boundary. N d was chosen to be a multiple of the number of boundary electrodes, so that each electrode corresponds to the same number of boundary elements, and other boundary elements are disjoint from all electrodes. Figure S4 shows an illustration of a triangulation of the domain used to discretise the PDE, with N d = 64. Figure S5 shows the posterior distribution obtained from BayesCG for different values of ϵ. The linear system solved was generated for N d = 128 and with the conductivity fieldσ(z). Plotted is the posterior mean from BayesCG, along with samples from the posterior distribution, over the spatial domain of the PDE. That is, the voltage field v(z) has been plotted rather than the conductivity field from the inverse problem. The top row has the largest value of ϵ, and here clearly the posterior mean deviates far from the true solution, depicted in the bottom row. However by ϵ = 5 the mean from BayesCG appears close to the truth. The second, third and fourth column show samples from the posterior distribution, and while there is significantly more noise in these columns the main characteristics of the true solution are visible even at ϵ = 20, suggesting that the use of BayesCG within a Bayesian approach to EIT can be qualitatively justified.  Figure S3: Assessment of the uncertainty quantification provided by the Gaussian version of BayesCG, for the PDE example from Section S4, as dimension is varied. The statistic Z is reported based on 100 independently sampled test problems, for different prior covariances Σ 0 . These are compared to the theoretical distribution of Z when the posterior distribution is well-calibrated. The right panel zooms in on the portion of the x-axis occupied by the statistic for Σ 0 = (P ⊤ P ) −1 .  Figure S4: Finite-element discretisation used for the EIT experiment described in Section 6.2. The mesh was generated using the Python package meshpy, configured to ensure that there were 64 equally spaced boundary elements. Red lines indicate the elements which correspond to electrodes. Green dots show the locations at which the posterior conductivity field was sampled. Fig. S6 shows the behavior of the posterior distribution when a standard CG forward solver is used but with m held fixed. At m = 40 the computed mean bears no resemblance to the actual posterior mean. Moreover, the computed distribution is over-confident, as reflected by the uniformly lower computed standard deviation, compared to the actual posterior. At m = 60 the qualitative features of the posterior mean have been recovered, though the recovery still differs noticeably from that of the actual posterior, and the computed standard deviation remains lower. This provides further motivation for the use of BayesCG, as a means to constrain the solver to fewer iterations while still obtaining estimates that are statistically meaningful. Fig. S7 repeats this experiment for the BayesCG forward solver when the preconditioner prior is used. Again, as m is increased the posterior mean exhibits clear structure in the conductivity field. While the posterior variance does not visibly appear to decrease in the bottom row, the integrated standard deviation is nevertheless decreasing, starting at 0.0586 at m = 40, decreasing to 0.0459 at m = 60 and 0.0365 at m = 80. Transactions on Medical Imaging, 23 (7) Figure S7: Posterior means (top row) and standard deviations relative to the actual ("baseline") posterior (bottom row) for the EIT problem using a BayesCG forward solver, as m is varied.