Iterated preconditioned LSQR method for inverse problems on unstructured grids

This article presents a method for solving large-scale linear inverse imaging problems regularized with a nonlinear, edge-preserving penalty term such as total variation or the Perona–Malik technique. Our method is aimed at problems defined on unstructured meshes, where such regularizers naturally arise in unfactorized form as a stiffness matrix of an anisotropic diffusion operator and factorization is prohibitively expensive. In the proposed scheme, the nonlinearity is handled with lagged diffusivity fixed point iteration, which involves solving a large-scale linear least squares problem in each iteration. Because the convergence of Krylov methods for problems with discontinuities is notoriously slow, we propose to accelerate it by means of priorconditioning (Bayesian preconditioning). priorconditioning is a technique that, through transformation to the standard form, embeds the information contained in the prior (Bayesian interpretation of a regularizer) directly into the forward operator and thence into the solution space. We derive a factorization-free preconditioned LSQR algorithm (MLSQR), allowing implicit application of the preconditioner through efficient schemes such as multigrid. The resulting method is also matrix-free i.e. the forward map can be defined through its action on a vector. We illustrate the performance of the method on two numerical examples. Simple 1D-deblurring problem serves to visualize the discussion throughout the paper. The effectiveness of the proposed numerical scheme is demonstrated on a three-dimensional problem in fluorescence diffuse optical tomography with total variation regularization derived algebraic multigrid preconditioner, which is the type of large scale, unstructured mesh problem, requiring matrix-free and factorization-free approaches that motivated the work here.


Introduction
Inverse problems arise in almost all fields of science when details of a model have to be determined from a set of observed data. Formally, we consider a mapping → A X Y : as the forward problem and the inversion of this mapping as the inverse problem. A defining characteristic of such problems is that they are ill-posed whenever the forward mapping is a compact operator which, for infinite-dimensional linear operators, is equivalent to a gradual decay to zero for the singular values of the mapping A. As a consequence of the ill-posedness of A, its inversion is unstable and requires regularization, see e.g. [1][2][3].
In this work, we are interested in linear problems after appropriate discretization, where  = X n X ,  ∈ n X is the space of images (large-scale for two or three spatial dimensions) and is the data space. Broad class of such problems includes image deblurring, denoising and inpainting, tomography based on the radon transform (x-ray CT) or attenuated radon transform (PET, SPECT), or fluorescence diffuse optical tomography (fDOT).
For many of those problems, the matrix representation of A becomes too large to form explicitly, even for moderately sized images. Therefore in this paper we focus on methods using an implicit representation of A, which is sometimes referred to as the matrix-free approach. Examples are projection operators in x-ray CT, PET and SPECT, and the solution of direct and adjoint partial differential equations in fDOT.

Linear problem with nonlinear edge-preserving regularizer
The problem we aim to solve is of the form where A is a discrete forward operator (possibly defined only as its action on a vector), g denotes the data, f the unknown solution and R(f) is an appropriate discretization of a nonlinear regularizer of the type with ∈ ∞ ( ) [ ) r C 0, 1 and τ > 0 controls the relative weight of the data fidelity and the regularization term in (1). The function r acting on  | | f x ( ) makes such regularizers of immediate interest to imaging problems, where preservation of the edges is desirable. Two prominent examples are • total variation [4] (sparsity prior for edges) and its smoothed approximation In (4), T is a threshold parameter indicating a level of image structure below which edges are considered as noise; we apply the same interpretation to this parameter in (3) to illustrate the generic approach, even though in the total variation literature this factor is usually stated as a purely numerical correction.

Lagged diffusivity fixed point iteration
In this paper we solve (1) by searching for a critical point, , which amounts to solution of the nonlinear normal equation T In the continuous setting, the Fréchet derivative of  at f in the direction h is defined as  which is the weak form of the inhomogeneous diffusion operator   =− ·  c x : ( ) f f . With M f representing a discretization of  f we define the discrete derivative as ′ = R f M f ( ): f . As the nonlinearity is restricted to the regularizer, we solve (5) using the lagged diffusivity fixed point iteration Under the assumption of zero-mean white Gaussian noise (i.e. noise covariance has already been incorporated into A and g), the unnormalized likelihood becomes For image reconstruction problems, choosing a prior π X modelling the distribution of images with weak or vanishing correlation across edges we arrive at (1). The linearized problem (9) can be interpreted as choosing a zero-mean Gaussian prior with a (possibly improper) covariance C X defined via The linear problem with nonlinear edge-preserving regularizer admits a Bayesian interpretation in terms of hyperpriors [9]. Hyperprior models the amplitude of the edges as a random variable, which estimation becomes a part of the inverse problem. A method alternatingly updating the parameters of primary interest and hyperparameters results in a lagged diffusivity-type method. In particular, choosing Gamma or inverse Gamma distribution as the hyperprior results in a lagged diffusivity iteration with TV or Perona-Malik regularizer, respectively.

Overview of the contribution
We present a matrix and factorization-free algorithm for large-scale linear inverse problems with nonlinear regularizers. The nonlinearity is limited to the regularization term, and is handled with lagged diffusivity fixed point iteration [6], which amounts to repeated solution of the problem linearized at the previous iterate.
We derive a new accelerated LSQR algorithm, MLSQR, for solution of linearized least squares problems. MLSQR is analytically equivalent to LSQR applied to the preconditioned least square problem with the preconditioner − , but in MLSQR the preconditioner is applied implicitly avoiding the costly factorization of − M f k 1 and at the same time allowing for using efficient algorithms such as multigrid methods for applying the preconditioner. Preconditioning with − L 1 is equivalent to transformation to the standard form and recently has also been termed priorconditioning [8][9][10][11][12] due to its interpretation as whitening of the prior in Bayesian framework. Using GSVD decomposition, we show that the Krylov subspace corresponding to the priorconditioned problem (problem in standard form) indeed amplifies the directions spanning the covariance of the prior, in contrast to the Krylov subspace for the original regularized problem. This explains the observed accelerated convergence for problems with discontinuities when priorconditioning with regularizers of the form (2).
The remainder of this paper is organized as follows. The following two sections deal with the solution of the linear least square problems arising from the linearization of the normal equations in each step of the lagged diffusivity iteration. In section 2, we compare the Krylov space solvers applied to the linearized problem in the original form (unpriorconditioned problem) and standard form (priorconditioned problem). We discuss various benefits of priorconditioning and we demonstrate the superiority of the Krylov space spanned by the solver for the priorconditioned problem. Section 3 derives the MLSQR algorithm, for matrix and factorization-free solution of the preconditioned least squares problem (14). Section 4 discusses the nonlinear solver based on successive linearizations and subspace Inverse Problems 30 (2014) 075009 S R Arridge et al priorconditioning for solution of the nonlinear regularized problem (1). The performance of the method is demonstrated on a three-dimensional fDOT problem in section 5. We conclude with a summary of the results and a discussion of prospective research directions.

Priorconditioning of Krylov spaces
In this section we discuss solution of the linearized regularized least squares problem (9) using Krylov solvers. In particular we compare the effect of Krylov solvers applied to the original (unpriorconditioned) problem (9) and the problem in standard form (priorconditioned problem) (14).

Krylov methods for linear inverse problems
The regularized least squares problem (9) can be rewritten as f X which can be solved with the LSQR algorithm [13,14]. LSQR is analytically equivalent to CG method applied to the corresponding regularized (but unpriorconditioned) normal equation T T and solves (11) (and equivalently (9)) over the Krylov space For readability in this and the next section which deal with the linearized problem we drop the subscript indicating the connection to the nonlinear problem i.e = − M M : f k 1 . We note the following difficulties with methods based on the space • If the regularization functional is designed to promote edges, solutions exhibit slow convergence due to the slow build up of high frequencies. As an example the operator considered in section 2.4 is a convolution which eigenvectors are Fourier-type modes with the oscillation frequency growing as the corresponding eigenvalue decreases. Such a modal basis provides only a poor approximation for discontinuities, which is the reason why Krylov methods converge slowly for the deconvolution of functions with jumps. • Regularization can be controlled either through the parameter τ or the limit on the dimension i max of the solution subspace in (13). While the truncation index i max is usually determined implicitly within an iterative algorithm by the choice of a stopping rule, change of τ requires a recomputation of the subspace (13) from scratch.

Solution space priorconditioning
Assuming L is invertible, the change of basiŝ Since the transformed variablef has now standard multivariate normal distribution, it is sometimes referred to as whitened in the signal processing literature. In Bayesian context, transformation to standard form is also referred to as priorconditioning [8][9][10][11][12].
The normal equation corresponding to the priorconditioned problem (14) is exactly the normal equation (12) with symmetrically (split) preconditioning T T 1 T T 1 Hence, the solution space spanned by LSQR applied to (14) is , .

T max
Note that for the priorconditioned formulation, the solution f is contained in for any choice of τ. Hence, in the preconditioned normal equation (15), τ acts solely as a damping parameter.
Application of LSQR to (14) instead of (9) has a number of benefits: • Local structure of the prior is embedded directly in the transformed operatorÂ allowing quick convergence, see discussion in section 2.3. • The corresponding Krylov space is the same for all values of τ due to translation invariance. • As the prior is embedded into the transformed operator, it is possible to exploit the prior even without an explicit regularization term, i.e. setting τ = 0. • The transformed variables are dimensionless, which removes issues with physical units if further transformations, e.g. exponentiation, are used.

Accelerating convergence of LSQR
For well-conditioned problems preconditioning is an established technique to accelerate the convergence of Krylov subspace methods through clustering the eigenvalues of the transformed problem. For ill-posed inverse problems, the mapping A is a discretization of a compact operator which singular values accumulate at zero, rendering such clustering impossible. Therefore, some different form of preconditioning is necessary, cf [18][19][20][21][22][23] for some earlier work.
In this article we advocate acceleration by priorconditioning [8,[10][11][12] which embeds the prior information like discontinuities directly into the forward operator, which would otherwise require a large number of iterations to be built up in the Krylov subspace. This information is embedded in all the eigenvectors even those corresponding to the large eigenvalues and hence into the Krylov subspace from early on. As a result the priorconditioned solver requires only a few iterations to converge. Technically, priorconditioning can be achieved through preconditioning with = − M C , X 1 a preconditioner based on the properties of the expected solution rather than those of the forward mapping. In contrast to preconditioning, no clustering of eigenvalues takes place. On the contrary, the condition number of the priorconditioned matrix is usually larger than that of the original matrix.
In an attempt to shed light on the phenomena we express the corresponding Krylov spaces, in terms of the general singular value decomposition (GSVD) of the pair (A,L). The GSVD of (A, L) for L square and of the same column dimension as A can be written in the following simplified form Here we use the following, nonstandard scaling of the GSVD vectors X: with Γ a matrix with the generalized singular values on the diagonal, which up to the ordering are the singular values of the matrix − AL 1 . Note, that by construction X simultaneously diagonalize A A T and L L T . With the above defined GSVD matrices we have and the corresponding Krylov spaces can now be written as Here we made use of Recall that the covariance of the prior T . Hence the generalized singular vectors X span the covariance matrix C X of the prior (even though X are not orthogonal). Inspecting Krylov vectors in (18)  On the other hand, we observe that − X T spans the range of the anisotropic diffusion matrix . The GSVD form of (17) reveals that each Krylov iteration boosts the directions in the range of the regularizer M spanned by − X T corresponding to the large singular values while damping those corresponding to the small ones with respect to the last Krylov vector; however there is no direct relation to the initial vector as in case of (18).
The GSVD has been previously used to describe the Krylov spaces above in [24] but no connection was made to the covariance of the prior. In [24], a projection method is derived for solving the linear problem (9) in case L is not invertible, where the projection subspace comes from joint bidiagonalization of A and L. Interestingly, this method yields a solution which also lies in the subspace The preconditioning interpretation allows for an important departure from the priorconditioning and the standard form framework, namely for approximate inversion of M. In our fDOT example we use such a symmetric approximate inverse obtained with algebraic Inverse Problems 30 (2014) 075009 S R Arridge et al multi-grid without observing deterioration of the quality of the solution. Formal analysis of the effect of the approximate inversion is out of scope of the present paper. While each priorconditioned iteration is more expensive than its unpriorconditioned counterpart due to the application of − M 1 , for many large-scale problems the additional cost is insignificant and outweighted by the benefit of faster convergence; see the discussion in section 6.

Example problem: deconvolution in 1D
Throughout the paper we use the following 1D deconvolution problem to illustrate properties of the method. Assume that the matrix A in the observation model is a discretization of the stationary convolution operator represents isotropic Gaussian white noise with zero mean. Figure 1 , and with high probability contain the edges of f true used for construction of M.

Implicitly preconditioned LSQR
On structured grids, it is natural to define the prior by constructing the matrix L directly using for instance finite difference discretization. As such L are typically nonsquare and hence noninvertible, numerical methods based on the transformation to standard form revert to Aweighted pseudoinverse cf [15,24]. On unstructured meshes, in contrast, the prior is usually given through a scaled inverse of the prior covariance, T , rather than its factor L. M is typically a second order differential operator cf (7) which can be cheaply stored due to its sparse representation. Taking a naïve approach, the construction of the with an invertible L, e.g. Cholesky decomposition. For large-scale problems such a factorization may be too expensive to compute, store and apply: although M is sparse, L will have considerable fill-in. Furthermore, for nonlinear problems the factorization would have to be recomputed at each linearization step of a nonlinear iteration. In this section we propose a factorization-free preconditioned LSQR algorithm (MLSQR), which solves the priorconditioned formulation (14) without actually factorizing M.
The split preconditioned symmetric system, (15) with τ = 0, can be solved by preconditioned conjugate gradient without the need to provide a factorization L L T for M (MCG) [25]. The key is to use suitable scalar products. For the left preconditioned normal equation  (14)).
Alternatively, the right preconditioned normal equation can be solved by preserving symmetry with the − M 1 -weighted scalar product, in which It is easy to see that both the left and right preconditioned variants produce the same sequence of computations and hence they are equivalent up to the round-off errors. In the following in this context equivalent means equivalent up to the round-off errors.

Preconditioned LSQR
The LSQR algorithm [13,14] is equivalent to the conjugate gradient method applied to the normal equation, but it avoids explicit formation of the normal equation at any stage. Instead, LSQR makes use of the Golub-Kahan bidiagonalization [26]. Applied to the least squares problem ∥ − ∥ g Af min (22) f with a starting vector g, the bidiagonalization procedure can be written in matrix form as  As all the quantities B i , U i and V i are independent of τ, the projected least squares problem

Factorization-free preconditioning
In the following, we derive an MLSQR algorithm, corresponding to the variant of MCG algorithm with the M-weighted inner products applied to the left preconditioned normal equation (20). The same algorithm (same sequence of computations) can be derived based on MCG with − M 1 -weighted inner products applied to right preconditioned normal equation (21). We chose to present the M-weighted variant (left preconditioning) because it works with the original solution, unlike the − M 1 -weighted one (right preconditioning) which involves a change of basis.
To this end we introduce new variables =ṽ Lv To conclude, the update in terms of the new variables and the original solution reads The resulting MLSQR method is summarized in algorithm 2.
We remark, that a factorization-free preconditioned LSMR method (MLSMR) can be derived in the same way, as the implicit preconditioning only affects the bidiagonalization procedure which is the same for both methods. In [27] it is suggested that LSMR could be preferable to LSQR if early stopping is used. A comparison of the two algorithms on the priorconditioned problem i.e. comparison of MLSQR and MLSMR is of independent interest but it is out of scope of this paper. Here we focus on MLSQR because it minimizes the residual of the original least squares problem (possibly with damping) over the priorcondi- . In contrast MLSMR minimizes over the same subspace the residual of the split preconditioned normal equation (15), which does not immediately relate to the original problem.  Two types of regularization are relevant in our framework: Tikhonov regularization, where the parameter τ controls the amount of regularization, and early truncation of Krylov methods, in which regularization arises from the problem being projected onto a small dimensional subspace. When priorconditioning is used, Tikhonov regularization results in a simple damped least squares problem (14). Damping for a fixed value of τ can be easily incorporated in LSQR as described in [14] at the cost of doubling the number of Givens rotations. Due to the shift invariance of Krylov spaces, different choices of the damping parameter τ result in the same Krylov subspace and Lanczos vectors V i . If V i are stored, the projected least squares problem (24) can be efficiently solved for multiple values of τ, which is of benefit when the value of τ is not known in advance. However, the additional storage requirements may limit the feasibility of such an approach. Some techniques for solving (24) with a variable τ are discussed in [28] using singular values and the first and last rows of the matrix of the right singular vectors of the bidiagonal matrix B i . Those quantities can be obtained at the cost  ( ) i 2 at the ith iteration. These strategies are however only viable for a limited number of iterations i as the singular value decomposition of the bidiagonal matrix B i can not be efficiently updated even though B i simply expands by a row and a column in each iteration. For larger i, the algorithm described in [29] for the least squares solution of (24) at the cost of  ( ) i for each value of τ is the preferable option.

Stopping criteria
In this paper we make use of two Krylov methods: the MLSQR method for solution of the priorconditioned problem (14) and the reference LSQR method for solution of the unpriorconditioned problem (11).
As the system in (11) is a well posed inconsistent system, we choose the criterion S2 from the original paper [13] for stopping LSQR.
and ATOL is a user-defined threshold, see [13].

Inverse Problems 30 (2014) 075009
On the other hand it is well known that (preconditioned) LSQR and hence MLSQR applied to the least squares problem (14) monotonically decrease the residual norm ‖¯‖ r i . For τ = 0, (14) is an undamped problem and hence it is ill-posed. Moreover, it holds that¯= r r i i with = − r g Af : i i being the residual of the original least squared problem. Consequently, the sequence ‖ ‖ r i is also monotonically decreasing and the Morozov discrepancy principle is a natural choice for terminating MLSQR at least when τ = 0.
where δ is the (estimated) noise level [30], and the factor η > 1 has been included to prevent underregularization [8].  (14) and LSQR applied to (11). The regularization parameter value τ = 1 was hand-tuned for the unpriorconditioned problem. In addition, the value τ = 0 was tested. For the priorconditioned problem selecting τ = 0 results in a priorconditioned solution without damping, while for the unpriorconditioned problem setting τ = 0 eliminates (Tikhonov) regularization. Figures 2(a), (b) show the solution of the priorconditioned and the unpriorconditioned problem with (τ = 1) and without (τ = 0) regularization. The solutions in figure 2(a) correspond to the LSQR and MLSQR algorithms terminated with criterion S4 with δ = ∥ ∥ − g 10 2 and η = 1.1, which resulted in stopping after 9 iterations for the priorconditioned problem and after 16 iterations for the unpriorconditioned problem for both tested values of τ. Figure 2(c) shows the corresponding residual norms i.e. the discrepancy principle S4. Evaluating the solutions in figure 2(a) we conclude that S4 provides a good stopping criterion for the priorconditioned problem, but it terminates the LSQR algorithm for the regularized unpriorconditioned problem too early (considerable oscillations even for τ = 1). Figure 2(b) shows the solutions obtained with the stopping criterion S2 with = − ATOL 10 2 . The criterion S2 over the iterations is plotted in figure 2(d). This criterion is essentially useful before the onset of oscillations which corresponds to the last good stopping point. While, with the chosen ATOL, S2 stops the unpriorconditioned regularized problem timely (for which S2 behaves rather smoothly even beyond the value of ATOL, in particular for τ = 1), it terminates the priorconditioned problem prematurely (wrong levels of the piecewise constant approximation for both values of τ). Comparing the best respective solutions for both formulations, the priorconditioned formulation after 9 iterations yields a better solution than the unpriorconditioned regularized formulation after 32 iterations.

Nonlinear solver
We recall that the lagged diffusivity fixed point iteration applied to the necessary condition (5) for the minimizer of (1) requires a solution of the linearized normal equation (8) at each iteration. Here, we propose to speed up the solution of the corresponding linear least squares problem (9) using priorconditioning, i.e. solving the priorconditioned linearized problem (14) with MLSQR instead of solving the original linearized problem (11) with LSQR.
Under certain assumptions and for a parameter value τ chosen such that the Morozov discrepancy principle S4 is satisfied, the minimization problem (1) can be reformulated as the residual method [31],   We emphasize that it is important to stop the Krylov method in step 4 of algorithm 3 consistently: we want to avoid the situation where the trend in values of R f ( ) k becomes distorted by the Krylov solver (e.g. taking few Krylov iterations in one step while taking many iterations in the next will likely lead to increase in R f ( ) k ). Both discrepancy principle as well as a global upper limit on the number of the Krylov iterations provide a consistent stopping rule. In practice, it may take numerous Krylov iterations before the first few lagged diffusivity iterates reach the noise level. To circumvent this problem we combine the Morozov discrepancy principle with the global limit on the number of Krylov iterations, which does not seem to adversely affect the convergence of the iterated MLSQR algorithm, see section 5.3.
A few more comments are in order regarding algorithm 3: • We deliberately chose to solve for + f k 1 rather than the update Δ = − . This is because the employed regularizers encode the information on the solutions f k and not on the updates Δ + f k 1 . • Our simulations suggest that solutions of high quality can be obtained by setting τ = 0 in (14) and regularizing solely by early stopping of the MLSQR algorithm. A clear advantage of this approach is that the data discrepancy r k is being minimized instead of ∥¯∥ r k . However, setting τ = 0 invalidates the interpretation of the solution computed with algorithm 3 as the MAP estimate of the posterior distribution (10).
is computed exactly, τ ≠ 0 is used and MLSQR is iterated until convergence, our scheme is equivalent to lagged diffusivity iteration. In practice, one or more of these conditions may not hold.

Inverting the preconditioner
While our preconditioner is nonstandard in the sense that it does not approximate the inverse of the forward operator it nonetheless amounts to a solution of an elliptic partial differential equation, a topic extensively handled in the literature see e.g. [32][33][34] for preconditioning of partial differential equations and [25,35] of linear problems in general.
Here we use algebraic multi-grid method [32]. In particular, the use of Jacobi, SSOR or incomplete LU decomposition (ILU) smoothers results in a symmetric preconditioner if the Inverse Problems 30 (2014) 075009 same number of iterations is taken in the pre-and post-smoothing stages. By using a fixed number of smoother steps and grid-to-grid operations, one obtains a fixed approximation of a matrix inverse that can be used to priorcondition Krylov methods such as MLSQR derived in section 3. The resulting preconditioner is an approximate inverse of the regularizer matrix M. Remarkably, in our experiments in section 5 even a low-cost single V-cycle multigrid approximation to − M 1 was enough to achieve effective priorconditioning and consequently fixing the number of cycles seems unproblematic. Further analysis is necessary to establish how to construct an approximation resulting in the most effective priorconditioner and how the accuracy of such approximation affects the convergence of the nonlinear iteration.

Iterated MLSQR: example 1D deconvolution problem
We are now in a position to consider the nonlinear 1D deconvolution problem with nonlinear Perona-Malik regularization (4) (left equation). We solve the associated minimization problem (1) with algorithm 3.
In this particular example, we chose to use the undamped solution, i.e. τ = 0. MLSQR was terminated with the discrepancy principle (S4) with the noise level δ = ∥ ∥ − g 10 2 and η = 1.1. We stopped the lagged diffusivity fixed point iteration when the relative change in the functional R dropped below 15%. As the considered problem is small, we used Cholesky factorization for inverting the preconditioner. The evolution of the solution at every third lagged diffusivity step is illustrated in figure 4(a) and the solution error over outer iteration is plotted in figure 4(b). The residual norm over the inner and outer iterations is shown in figure 5(a). Notice that every lagged diffusivity iteration resets the residual norm to the value g . This is due to the initialization of MLSQR with = f 0 k at the beginning of every outer iteration to allow the solution to develop features compatible with the new priorconditioner. The gap developing in the residual norm plot in figure 5(a) between the 5th and 6th MLSQR iterations is characteristic for the method. It means that the first five basis vectors are becoming increasingly better adapted to represent the solution, while the remaining basis vectors contribute less and less. Consequently, the updated priorconditioner allows reduction of the residual norm in fewer MLSQR iterations as shown in figure 5(c). Figure 5(b) displays the behaviour of the functional R over the inner and outer iterations. We observe that in each lagged diffusivity iteration, while solving the linearized problem R in general increases in each MLSQR iteration until a saturation level is reached. On the other hand, the saturation level decreases as the outer iteration progresses. At a certain point, the saturation level stops decreasing (or is not decreasing fast enough), see figure 5(d), and the algorithm 3 is terminated. The plot of the error norm in figure 4(b) demonstrates that the last step taken before the stopping criterion was triggered did not perceivably improve the solution, which corroborates our choice of the stopping criterion.

A 3D image reconstruction problem
In this section we apply our method to a large-scale 3D image reconstruction problem on an unstructured grid. The measured photon density for a detector with wavelength independent detector coupling coefficient Θ d is given as In practice several sources and detectors are deployed, resulting in a vector of measurements  λ ∈ y ( ) N N s d , where N s and N d is the number of sources and detectors, respectively. We furthermore use the following normalization, demonstrated to reduce the effects of unknown Θ s and Θ d [46,47]  placed along the same rings on the boundary, half way in between the sources, resulting in a total of 6400 measurements. All the involved coupling coefficient functions were modelled as Gaussian distributions on the boundary Ω ∂ . Measurement errors were simulated using additive Gaussian noise with standard deviation equal to 1% of the corresponding ideal measurement.
The fDOT problem was modelled and discretized using the finite element method (FEM) with software package TOAST [48]. The discretization with first order Lagrangian elements on a tetrahedral mesh yielded 27084 degrees of freedom which corresponds to approximately 1.5 mm resolution. The associated inverse problem was regularized with the differentiable approximation to the total variation functional in (2). The resulting matrix M f is thus a finite element discretization of the operator    · + − f T ( ) 2 2 12 . We chose = − T 10 6 as it yields a well-conditioned matrix M for fluorescence yield coefficients with edges of height approximately equal or smaller than the expected maximum of 0.1. The preconditioner was applied  using a computationally low-cost algebraic multigrid solver with single V-cycle and two steps of ILU(0) smoothing, implemented in IFISS [49,50].

Solution of the fDOT problem
We solve the fDOT problem described in section 5.1 using the iterated MLSQR method in algorithm 3. The proposed method is particularly well suited for problems of this kind: fDOT is large-scale and the explicit construction of the matrix A representing the discretized forward mapping can be impractical. Instead, the action of A on a vector is obtained through solution of (26)-(31) for each source. Similarly, multiplication by A T involves solving (26), (28) and (32)- (34). In our example, both of these mappings amount to 160 solves of elliptic partial differential equations. Moreover, because the FEM mesh is unstructured, the preconditioner M naturally arises in the unfactorized form.
We demonstrate the benefits of priorconditioning by comparing the performance of the iterated MLSQR method to an unpriorconditioned reference method or iterated LSQR (algorithm 3 but where in the 4th step instead the unpriorconditioned linearized problem (11) is solved with LSQR terminated with rule S2 as explained in section 3.4). The tolerance in S2 was chosen = − ATOL 10 3 , and in S4, η = 1.1 and δ = ∥ ∥ − g 10 2 . The same regularization parameter value τ = 10 4 was used for both methods, which was tuned to yield the best-case results for the unpriorconditioned reference method. Again, we also test τ = 0 demonstrating that the proposed algorithm is very robust with respect to the choice of τ. Note that for τ = 0 the reference method amounts to solution of the unregularized linear least squares problem with LSQR. As the least squares problem is ill-posed, LSQR is stopped with criterion S4.
We monitor the value of the penalty R f ( ) k of the intermediate solutions f k and terminate the lagged diffusivity (outer) iteration in algorithm 3 when R f ( ) k ceases to decrease. For the reference iterated LSQR method, the functional R f ( ) k stagnates or even starts increasing after initial five outer iterations while the error norm keeps decreasing, which could be due to increasing condition number of M f k . Instead, 25 lagged diffusivity steps were taken to ensure convergence for the reference method for τ = 10 4 . The resulting reconstructions are depicted in figures 7 and 8. The two unpriorconditioned solutions suffer from overshooting and spurious oscillations, while their priorconditioned counterparts have the correct shape and estimate the value of the fluorescence yield coefficient h more accurately. The error norms in each lagged diffusivity iteration are plotted in figure 9(a). The iterated MLSQR attains error norm of 0.5428 and 0.5407 for τ = 10 4 and τ = 0, respectively, while the reference method 0.7158 and 1.5025. Figure 9(b) shows the number of MLSQR iterations taken within each lagged diffusivity step. We chose to limit the number of the inner iterations for the priorconditioned algorithm to a maximum of 20 in accordance with the discussion in section 4.1. This limit was hit in the first 6 (τ = 10 4 ) and 5 (τ = 0) lagged diffusivity iterations, while MLSQR was stopped by the Morozov criterion in fewer than 20 iterations in the following lagged diffusivity steps. In fact, the motivation for limiting the maximum number of MLSQR iterations is that the first one or two lagged diffusivity steps would need an disproportionate number of MLSQR iterations to reach the residual level suggested by the Morozov discrepancy principle. The limit on Krylov iterations alleviates this issue without a noticeable effect on the final reconstruction. While after the initial phase the number of MLSQR iterations decreases to 9 for the priorconditioned algorithm, the unpriorconditioned variant requires an increasing number of Krylov steps, which we attribute to the growing condition number of the regularizer M f k over the lagged diffusivity iterations. Figure 9(c) shows the evolution of the penalty term R(f) over lagged diffusivity iterations. In the first outer iteration, the two priorconditioned solutions show a pronounced difference. The initial preconditioner is a discretization of the Laplacian and contains no information about the edges of the solution. For this reason, the damping flattens the first outer iterate particularly strongly, and consequently the associated penalty  ∫ = | | Ω R f f x x ( ) ( ) d is of smaller magnitude than it is for the undamped variant. Notice that similar flattening takes place for the unpriorconditioned method, as well. Once the preconditioner contains some information about the edges, the behaviour of both priorconditioned cases is qualitatively the same. Apart from the first lagged diffusivity iteration, the penalty decreases over the lagged diffusivity iterations, providing further evidence to support the penalty-based stopping criterion of algorithm 3.

Conclusions
In this paper, we considered efficient solution of large-scale linear ill-posed inverse problems with nonlinear regularization. We devised a highly efficient matrix-free algorithm for solution of such problems combining a lagged diffusivity fixed point iteration with priorconditioning of Krylov methods. Priorconditioning affords a way of embedding the information contained in the prior directly into the forward operator resulting in highly accelerated convergence of Krylov methods. A novel factorization-free preconditioned LSQR algorithm was presented for solving the linear priorconditioned problem which allows an implicit application of the preconditioner through efficient schemes such as multigrid. This is of particular interest for problems formulated on unstructured grids, where the preconditioner naturally occurs in an unfactorized form and the factorization is computationally infeasible. Furthermore, the relevant regularizers arise as discretizations of elliptic partial differential equations for which approaches like multigrid have been extensively studied and applied. The presented algorithm is matrix-free, i.e. capable of solving problems where the forward mapping cannot be computed and/or stored explicitly as a matrix. In particular, in such cases the cost of the application of the forward and adjoint mappings in each step of the Krylov method overbears the additional cost of preconditioning, while the use of priorconditioning reduces the number of Krylov iterations to a fraction of those needed without priorconditioning.
We demonstrated the effectiveness of our algorithm on the 3D fDOT image reconstruction problem. For truly large-scale problems, for which the diffusion equations involved in the forward and adjoint mapping have to be solved iteratively, the cost of applying the forward and adjoint mappings is typically running hundreds of times higher than the cost of applying the preconditioner, making the reduction in the number of Krylov iterations paramount.
The proposed method opens an interesting question of using approximate inverses of regularizers as priorconditioners. Which approximations are the most effective priorconditioners and how does the accuracy affect the convergence? What is the interpretation of such priorconditioners in terms of prior distributions? Finally, the present paper deals with illposed problems, where the forward mapping is linear and the nonlinearity is limited to the regularization term. We intend to investigate the extension of the ideas to fully nonlinear inverse problems such as electrical impedance tomography and diffuse optical tomography.