Preconditioners for Krylov subspace methods: An overview

When simulating a mechanism from science or engineering, or an industrial process, one is frequently required to construct a mathematical model, and then resolve this model numerically. If accurate numerical solutions are necessary or desirable, this can involve solving large‐scale systems of equations. One major class of solution methods is that of preconditioned iterative methods, involving preconditioners which are computationally cheap to apply while also capturing information contained in the linear system. In this article, we give a short survey of the field of preconditioning. We introduce a range of preconditioners for partial differential equations, followed by optimization problems, before discussing preconditioners constructed with less standard objectives in mind.


INTRODUCTION
Solving large linear systems Ax = b, where A ∈ C n×n and b ∈ C n , is a mainstay of scientific computing. Such systems arise in many applications-perhaps most frequently the discretization of partial differential equations (PDEs) and optimization problems-and their solution can be extremely time-consuming. Moreover, for certain time-dependent, nonlinear, or inverse problems, for instance, multiple linear systems must be solved, so a reduction in the linear solve time results in a huge increase in the efficiency of the overall numerical scheme.
For the solution of a wide class of linear systems, the use of direct methods is both popular and powerful (see [1] for a survey of such methods for sparse systems). Here, a sequence of operations is performed on the matrix in order to generate a suitable factorization, therefore returning a single estimate of the solution of the system. However, if the matrix in question becomes sufficiently large for a particular computer architecture, the storage and operation costs of a direct method may become excessive. In such a case it becomes valuable to apply iterative methods, where no such factorization is required, in order to generate a sequence of approximate solutions. Such methods generally require matrix-vector multiplications at each iteration, but these may be performed without storing the entire matrix: one may work with individual blocks of the matrix, or indeed apply "matrix-free" approaches (see [2,3] for instance). Here, we focus specifically on Krylov subspace methods, which we detail in Section 2; for such methods A is indeed only accessed via matrix-vector multiplications. However, without the use of preconditioning (that we describe in greater detail in Section 2) the Krylov solver may not converge, or may take many iterations to return an acceptably accurate solution. In the best case, preconditioned iterative methods can substantially reduce the storage and operation costs as opposed to direct methods, while achieving greatly improved convergence properties that may often be proved a priori. Two fundamental goals when devising a preconditioner can therefore be summarized as follows: • Facilitating rapid convergence of an iterative method.
• Guaranteeing a reduction in terms of storage and computational operations as opposed to direct methods. Accordingly, preconditioning allows for the solution of problems that cannot be tackled using either direct schemes or unpreconditioned iterative methods.
The design of such a preconditioner necessitates a trade-off, as one then needs to solve a linear system involving the preconditioner at least once per iteration. Hence, a preconditioner must be constructed such that it is cheap to apply its inverse operation to a given vector, while also capturing the characteristics of the original system in some sense. The second criterion will be explored further throughout this paper, as its interpretation depends on several factors including which iterative method is used.
There are already several excellent surveys of preconditioners in general, and for specific applications; we hope that this article complements these existing works. In particular, we mention the extensive surveys by Benzi [4] and Wathen [5], as well as the books by Bertaccini and Durastante [6] and Chen [7]. Further works surveying certain areas of preconditioning include [8][9][10][11]. A key structural difference in this paper is that we predominantly categorize by application area rather than by type of preconditioner. Throughout this text, we additionally point to surveys on individual topics.
We begin our discussion by describing the most widely used Krylov subspace methods in Section 2. We particularly emphasize the convergence theory available for these methods, since this can be used to determine preconditioners that fulfill the aims outlined above. Section 3 then details preconditioners appropriate for PDE problems, while Section 4 discusses preconditioners for linear systems that arise in optimization. Although preconditioning typically focuses on ensuring that the preconditioned matrix has "favorable" properties, there are also other ways to precondition, and reasons for doing so, as we illustrate through selected examples in Section 5. Finally, we summarize our key messages in Section 6.
Notation: Throughout, we use bold lowercase letters, for example x, to represent vectors, and uppercase letters, for example M, for matrices. We denote by A a matrix that is being solved for and by P a preconditioner, except in the case that the block structure is important, when we use the calligraphic letters  and . In this case A can denote a sub-block of . A general matrix is represented by M. We denote the Hermitian transpose of x by x H and the transpose by x T . If M ∈ C n×n is Hermitian positive definite then it defines a norm, which we denote by ||x|| M ∶= √ x H Mx. We will call a preconditioner optimal if it requires O(n) floating point operations to apply, and if the number of iterations required to attain a certain accuracy can be bounded independently of n. Whenever we cite several references consecutively, such as [4,5,7], this is a possibly nonexhaustive list and should be read "see, for example, [4,5,7] and the references therein".
The first Krylov subspace method for linear systems, namely the conjugate gradient (CG) method, was proposed by Hestenes and Stiefel [21], with Lanczos also working on closely related topics in this period [22,23]. The CG method must terminate in at most n iterations in exact arithmetic. Thus, although Hestenes and Stiefel [21] and Lanczos [23] refer to CG and the related Lanczos method as iterative solvers, over the next two decades mathematicians (in contrast to engineers and physicists) considered CG a direct method. However, in this capacity it was often outperformed by other approaches. It was only after John Reid's observation [24] that good approximations to x could be obtained in far fewer than n steps, that Krylov methods were again viewed as iterative methods. 1 Renewed interest followed, with new methods developed and analyzed. For Hermitian matrices the most popular are probably MINRES and SYMMLQ [26] while for non-Hermitian matrices such methods include GMRES [27], FOM [28], LSQR [29], LSMR [30], and families of methods based on QMR [31], BiCG [32] and BiCGStab [33], and IDR [34,35].
Krylov subspace methods for linear systems start from an initial guess x 0 (often the zero vector), compute the initial residual r 0 =b−Ax 0 , and then (typically) select iterates x 1 , x 2 , … such that thereby projecting the original problem to one of much smaller dimension. This approach is naturally linked to polynomial approximation, since x k − x 0 ∈  k (A, r 0 ) ⇔ x k = x 0 + q k − 1 (A)r 0 for some q k−1 ∈ Π k−1 with Π k denoting the set of polynomials of degree at most k. Accordingly, the kth residual satisfies r k = b − Ax k = (I − Aq k−1 (A))r 0 = p k (A)r 0 , p k ∈ Π k , p(0) = 1.
We have not yet described how iterates are chosen from these Krylov subspaces, and indeed this choice distinguishes different methods. We will not discuss this in detail here, but will instead focus on the available convergence theory for these methods.

Hermitian problems
We first consider the CG method for Hermitian positive definite A, which is perhaps the best understood Krylov method.
CG chooses x k according to (1) such that ||e k || A is minimized, where e k =x−x k . It requires only short-term recurrences so that the computational work at each iteration is fixed. Crucially for preconditioning we can bound ||e k || A as follows: where (A) is the spectrum of A. This bound is sharp, in the sense that for every k there is a (possibly different) initial guess x 0 for which it is attained. That said, given a specific initial residual, that is, a specific right-hand side and initial guess, the convergence rate may be much faster than is predicted by the above bound. We refer to the book by Liesen and Strakoš [15, section 5.6] and the references therein for a careful CG convergence analysis. When A is Hermitian but indefinite, we can instead apply MINRES, which also has fixed work per step and for which an analogous convergence bound can be derived. MINRES chooses x k such that ||r k || 2 is minimized subject to (1), and in this case This bound is descriptive in the same sense as the CG bound mentioned above [18], but the eigenvalues of A may also be negative. We see from the above bounds that convergence of CG and MINRES is strongly related to the spectral properties of A. Since these are governed by the problem at hand, it is often desirable to improve the convergence rate by solving an equivalent linear system with "better" spectral properties, that is, by preconditioning. To preserve symmetry, for CG and MINRES we precondition symmetrically, that is, one may instead consider the formalism where C ∈ C n×n is nonsingular. For this preconditioned system, the spectral properties of C −1 AC −H become important, and by a careful choice of C it is hoped that convergence will be faster. We note that by rewriting the CG or MINRES algorithm it is possible to avoid computations with C and C H and work only with the symmetric positive definite matrix P = CC H . However, the preconditioner for CG and MINRES must be Hermitian positive definite in general.

Non-Hermitian problems
GMRES is an extension of MINRES to non-Hermitian systems, and because it minimizes ||r k || 2 it has the most complete convergence theory out of solvers for such systems. If A is normal (2) holds, although now the eigenvalues of A may be complex, and this makes the convergence of GMRES more challenging to understand. However, when A is far from normal, no convergence bound is universally descriptive [36], and the spectrum alone cannot characterize the convergence rate of GMRES [37][38][39][40]. A major disadvantage of GMRES is that the computational work grows with each iteration, and one option is to restart GMRES [16, chapter 6]. Alternatively we may use a method based on bi-orthogonalization, such as QMR, BiCGStab, or IDR. However, none of these methods has a tractable optimality property, and so convergence results are extremely limited. Thus for non-Hermitian, particularly nonnormal, problems preconditioning is heuristically motivated although, as we shall describe, this has not prevented the development of many effective preconditioners for a range of such systems. We note that for non-Hermitian problems it is possible to perform A key restriction is that the preconditioner must represent some linear operator. If this is not the case a flexible Krylov subspace should be used [16, section 9.4].
We end this section by remarking that it is always possible to symmetrize a problem by, for example, solving the normal equations A H Ax = A H b. CG or MINRES can be applied to this symmetrized system, although in practice the more numerically stable but mathematically equivalent LSQR (for CG) or LSMR (for MINRES) are used. In this case we again recover descriptive convergence theory. On the other hand, convergence rates for problems solved using the normal equations are often significantly worse than for the original linear system, and preconditioning A H A (or AA H ) is often challenging; indeed it is possible that P H P is an arbitrarily poor preconditioner for A H A even if P is a suitable preconditioner for A [41].

PRECONDITIONERS FOR PDES AND RELATED PROBLEMS
Often, linear systems arise upon the discretization of an infinite-dimensional problem, for example differential equations and PDEs, or integral equations. Here we focus on second-order PDEs, but some of the underlying themes and principles can be applied to, or adapted for, discretizations of other infinite-dimensional problems. An important point is that solving the linear system is not an end in itself; instead, the overarching goal is to approximate the solution to the infinite-dimensional problem. The accuracy of this solution is affected by the discretization error as well as the error in terminating the iterative solver (the algebraic error). The interplay between these errors [42,43], and their effect on the stopping criteria [44][45][46], should be considered, although this is typically a nontrivial task.
The convergence rates of iterative methods for discretized problems are influenced both by the properties of the infinite-dimensional problem and the choice of discretization. Knowledge of the properties of the underlying PDE (particularly the weak form) has been exploited to develop so-called "operator" preconditioners [47][48][49][50][51][52] [53, chapter 2]; for an overview of these, see [54, chapter 4] [55]. The recent paper [56] also treats operator preconditioners in infinite-and finite-dimensional settings and describes an abstract framework for operator preconditioners in terms of a splitting of a Hilbert space into a finite number of subspaces. Other preconditioners that, at least in their original forms, are based on knowledge of the underlying differential equation include multigrid and domain decomposition, which we discuss in more detail below.
Discretization choices also affect the properties of the coefficient matrix, and hence the convergence rate of the Krylov subspace method. In particular, PDE problems require global transfer of information but popular discretizations, for example finite element, finite difference, and finite volume methods, lead to sparse matrices that transfer information only locally. Preconditioning PDE problems can be viewed as transforming to a more appropriate basis [54, chapter 8]. This perspective has also been used more explicitly in hierarchical basis preconditioning [57,58] (see also the survey by Xu [59]), and cardinal basis preconditioning for radial basis discretizations [60,61], which were originally introduced in the context of interpolation [62].
In this section we discuss preconditioners for some prototypical PDEs; these PDEs either arise on their own or, increasingly commonly, as key components of more complicated, possibly nonlinear PDE problems. These include the Poisson, Helmholtz, and convection-diffusion equations, which represent problems involving a single scalar-valued variable, and the Stokes and Navier-Stokes equations, which are examples of systems of PDEs. We also discuss specialized methods for Toeplitz matrices, time-dependent problems, and problems with heterogeneous coefficients.

Poisson problems
The first equation we consider is the generalized Poisson problem: find u such that subject to appropriate boundary conditions, where is a sufficiently smooth positive function that is both bounded above and uniformly bounded away from zero on Ω, and Ω ⊂ R d , d ∈ {2, 3}, is a bounded connected domain with piecewise smooth boundary Ω. (The standard Poisson problem is recovered when ≡ 1.) Equation (3) is elliptic and self-adjoint, and arises in many applications in, for example, chemistry, astrophysics, fluid dynamics, solid mechanics, electromagnetics, and image processing. It is also a subproblem of more complicated PDEs, such as the Navier-Stokes equations. Discretization by standard numerical methods leads to a linear system Ax = b, where A ∈ R n × n is symmetric and at least positive semidefinite. When (3) has separable coefficients on rectangular domains, it may be discretized by finite difference methods and then solved by tailored direct methods known as fast Poisson solvers [63][64][65][66]. However, for more complicated problems, domains, meshes, or discretizations, the preconditioned CG method is usually the method of choice. In these cases, fast Poisson solvers may be used as preconditioners, although their effectiveness deteriorates as the complexity of the problem increases. These fast Poisson solver preconditioners can also be viewed as operator preconditioners that approximate (− ∇ 2 ) −1 , see, for example, [55]. Recently, it was shown that when Laplace preconditioners are applied to finite element discretizations of (3) with nonconstant , the eigenvalues of the preconditioned matrix are approximated by nodal values of [67].

Multigrid
More robust preconditioners for elliptic, and indeed hyperbolic and parabolic, PDEs include multigrid methods, which are iterative solvers in their own right. Standard multigrid methods are based on the observation that when stationary iterations such as Jacobi or Gauss-Seidel are applied to problems from PDEs, in many cases they quickly reduce high-frequency components of the error but are less effective at removing smooth, low-frequency, components. Multigrid exploits this, and a sequence of smaller linear systems, to accelerate the convergence of these stationary iterations.
Multigrid is more easily understood by first considering a two-grid scheme. Within this two-grid iteration, a few steps of a stationary iterative method result in the approximationx to x, with smooth error e = x −x. We then improvex by approximately solving the associated residual equation Ae = r. Specifically, since e, hence r, are smooth they are well represented by a smaller (coarse grid) problem A c e c = r c that is more easily solved. Projecting e c back to the fine grid then givesê ≈ e, and we obtain the improved approximation x ≈x +ê. Applying the stationary iterative method again at this stage (to post-smooth) is optional.
In multigrid, the process of smoothing and projecting is repeated recursively until we reach a problem that is small enough to be solved by a direct method; to obtainê we then interpolate this approximate solution on successively finer grids by traversing the grid hierarchy in the opposite direction. The motivation for this repeated grid transfer is that low-frequency error components for the original problem become high-frequency components when represented on this coarse grid, which can again be smoothed by the stationary iteration. If a single two-grid iteration is performed at each recursive level we obtain a V-cycle; if two such iterations are performed on each recursive level we instead obtain a W-cycle, see, for example, [68, section 2.4].
Multigrid ideas appeared in early papers [69,70], but the seminal work by Brandt [71] is widely regarded as the foundation of modern multigrid methods. These solvers were geometric multigrid (GMG) methods, using a sequence of meshes for Ω. GMG is convenient for simpler problems and structured grids for which interpolation and restriction of the error are more straightforward.
For complex domains, unstructured meshes, and/or heterogeneous coefficients, it can be challenging to develop effective GMG methods. To overcome this, algebraic multigrid (AMG) methods were developed from the 1980s onwards [72][73][74][75][76]. The key distinction is that coarsening is performed automatically using only matrix entries, in a way that mimics GMG methods. AMG is now widely used, not only for problems that are challenging for GMG, but also to avoid programming problem-specific solvers.
Multigrid methods have been the subject of much research over recent decades and further details can be found in, for example, [68,[77][78][79][80]. (Linear) multigrid is itself a (complicated) stationary iterative method; for many simpler problems it is also an optimal solver and in these cases it can be used on its own. For more complicated problems multigrid may converge slowly, if at all, because some error modes are not adequately damped. In this situation it is much more effective to use a fixed number of multigrid iterations as a preconditioner for the CG method, provided the multigrid method represents a symmetric positive definite preconditioner, for example a V-cycle or W-cycle with smoothing during the interpolation phase that is symmetric with respect to the smoothing used when coarsening. Note that using a variable number of multigrid iterations, or an otherwise nonlinear method, results in a nonlinear preconditioner that should not be used with standard Krylov subspace methods.

Domain decomposition
Domain decomposition methods also replace the original Poisson linear system by smaller problems but, in contrast to multigrid methods, they divide the domain into subregions and solve (3) on each. Doing so requires the imposition of artificial boundary conditions, called transmission conditions, on subdomain boundaries (except where the subdomain boundary is part of Ω). These transmission conditions allow the transfer of information between subdomains. While subdomain solves originally corresponded to physical regions of Ω, it is also possible to specify subdomains using subsets of rows and columns of A.
The first domain decomposition method was proposed by Schwarz in 1870 [81] to prove existence and uniqueness of solutions to (3) on complicated domains by dividing them into two simpler subdomains. This alternating Schwarz method solved over each subdomain in turn, using the most up-to-date solution information to impose Dirichlet boundary conditions. Multiplicative Schwarz methods generalize this idea, but additive Schwarz and related methods [82,83] are generally preferred; one reason for this is that they can be more easily parallelized. In particular, the popular restricted additive Schwarz method adds the solution to each overlapping subdomain problem, using a partition of unity in overlap regions.
Additive Schwarz, multiplicative Schwarz, and related methods are also stationary iterative schemes that can be used as standalone solvers provided they converge [84,85]. However, local transfer of information means that their convergence rates can be slow, and so they are usually supplemented by some sort of global coarse grid solve. This global solve may additionally deal with modes that impede convergence [86][87][88].
Like multigrid, for complex problems domain decomposition methods are often much more effective as preconditioners. There are many such preconditioners available that differ in their choice of subdomains, overlap, transmission conditions, and coarse grid solves. Domain decomposition methods remain an active research area, and we recommend several books and surveys [89][90][91][92][93][94][95].

Helmholtz problems
A seemingly small modification of the Poisson equation (3) gives rise to the Helmholtz problem again augmented with appropriate boundary conditions, with k real-valued and Ω ⊂ R d , d ∈ {2, 3}, a bounded connected domain with piecewise smooth boundary Ω. This important PDE features in, for example, wave propagation problems and quantum mechanics, with oscillatory problems-that are characterized by moderate to large values of k-typically of interest. Once discretized, (4) leads to a linear system Ax = b, where A = K − k 2 M ∈ R n × n , with K the discretized negative Laplacian and M, for example, the identity matrix or a finite element mass matrix. A number of issues make the solution of this system challenging. First, complicated boundary conditions are often required for well-posedness and/or to mimic an infinite domain, and these can make A non-Hermitian, and possibly complex. Second, fine meshes are usually required to resolve the oscillatory solution and to avoid the "pollution effect", that is, numerical dispersion; this results in extremely large linear systems. Finally, even if A is Hermitian, as we consider here, it is almost always indefinite. In fact, A is singular if k 2 is an eigenvalue of M −1 K, that is, in the presence of resonances, although MINRES, for example, can be applied to singular but consistent systems.
Standard preconditioners for elliptic problems generally perform poorly on discretizations of the indefinite Helmholtz problem. Multigrid preconditioners may struggle to smooth errors, and the oscillatory nature of the problem limits the sizes of coarse grids [96]; however, tailored multigrid [97][98][99] and multilevel [100] methods attempt to remedy these issues. Domain decomposition methods also require careful treatment, for example a sufficiently fine coarse grid [101], special coarse grid solves [102,103], and/or modified transmission conditions [104][105][106].
Instead of approximating A directly, it is possible to approximate a "nicer" elliptic operator such as the negative Laplace operator [107] or the shifted operator [108] −∇ 2 + k 2 I by standard multilevel approximations. Unfortunately, the performance of these preconditioners deteriorates as k increases, as might be expected. More general "shifted Laplace" preconditioners based on −∇ 2 + I, with complex, typically perform better [109][110][111] but an appropriate choice of shift is important for efficiency [112,113]; small shifts reduce the number of Krylov subspace iterations but make the preconditioner itself difficult to approximate well. An additional issue is that a complex shift makes the problem complex and non-Hermitian. It is possible to instead solve a real block system, but this doubles the problem dimension and two elliptic subproblems must be solved per iteration [114].
Sweeping preconditioners have also received much attention of late. These incomplete (block) factorization preconditioners sweep through the variables in layers, using information about the underlying PDE to approximate Schur complements [115,116] (see the survey [117]). Recently, a different approach was proposed that utilizes contour integrals to project the spectrum of A, transforming the original problem to one that is easier to solve [118]. Surveys of Helmholtz preconditioners can be found in [119,120].
The key point is that the preconditioners of choice for the Helmholtz problem are not necessarily the same as for the Poisson problem. This is essentially because features of the PDE (oscillations, and loss of ellipticity) strongly influence properties of A, which must then be accounted for when designing preconditioners.

Convection-diffusion problems
Non-self-adjoint PDEs lead to linear systems with non-Hermitian coefficient matrices that must be solved by one of the nonsymmetric Krylov methods introduced in Section 2. Among the best-studied PDEs of this type is the convection-diffusion equation , 3}, with appropriate boundary conditions. One reason for this interest is that the Krylov subspace residual norms ||r k || 2 for unpreconditioned linear systems arising from convection-diffusion problems are often characterized by an initial period of stagnation, during which the residual norm reduction is extremely slow, followed by a second, faster convergence phase. This two-phase behavior is challenging to characterize using convergence bounds, and may be poorly described by the eigenvalues and eigenvectors alone [15, section 5.7] [121, chapter 26]. (We stress, however, that there are many linear systems with nonnormal coefficient matrices for which eigenvalues and eigenvectors do describe the convergence behavior.) Preconditioning is usually vital to reduce or remove this stagnation, and accelerate convergence. However, the nonnormality of the coefficient matrix makes it difficult to know a priori what should be achieved by a preconditioner. Despite this, many successful preconditioners have been proposed. For mildly convective problems, these may be based on the symmetric part of the coefficient matrix, that is, the part associated with the negative Laplace operator [122,123]. For convection-dominated problems, however, it is necessary to incorporate the nonsymmetric part. For example, the negative Laplace preconditioner can be replaced by stationary iterative methods that split the coefficient matrix into its Hermitian and skew-Hermitian parts [124,125].
Even the (block) Gauss-Seidel iteration may make for an effective preconditioner if variables are ordered in the direction of the flow [126,127] [128, chapter 7], and reordering is at the heart of incomplete factorization [129] and splitting preconditioners [130,131] for the convection-diffusion problem. For more complicated flows, particularly recirculation problems, it is challenging to order the variables well, so many preconditioners are based on simplifying the convection-diffusion operator [132,133]. A completely different option is to employ a matrix-equation solver for separable-coefficient problems as a preconditioner [134,135].
Multigrid preconditioners can also be effective, although advection causes difficulties that are not present for the Poisson problem. Gauss-Seidel smoothers can work well, but again for best effect the variables should be ordered according to the flow direction, or multiple sweeps in different directions should be used [80,128]. More generally, careful ordering of unknowns can improve multigrid preconditioners [136, section 10.4.3] [137,138]. In geometric multigrid methods, it is also important to avoid oscillations resulting from unstable coarse grid discretizations, which require more smoothing steps to remove. Options for dealing with oscillations include semi-coarsening [139,140], matrix-dependent transfer operators [68, section 7.7.5] [80, section 5.4], upwinding [137,141], or rediscretizing the coarse grid operator at each level using an appropriate stable discretization [142]. Robust multigrid methods for the Navier-Stokes equations also tackle the difficult problem of building a scalable preconditioner for a convection-diffusion operator [143,144]. AMG methods can make good preconditioners for convection-diffusion problems [145][146][147] by automatically identifying layers and/or semi-coarsening. On the other hand, it may take many levels to reach an acceptably small "coarse grid" problem [147].

Systems of PDEs
Systems of PDEs can arise for a number of reasons, the most obvious cause being interactions between different components, for example in fluid-structure interactions or the magnetohydrodynamics equations. Such systems also naturally occur when there are interactions between different quantities, for example the pressure and velocity of a fluid in the Navier-Stokes equations. Alternatively, it may be more convenient to write a single high-order equation as a system of lower-order equations; a classical example is the biharmonic equation, which can be reformulated in terms of two second-order equations. 2 Systems may also arise when it is helpful to separate different components of the discretized problem, for example in finite element discretizations where it can be advantageous to group variables corresponding to different types of basis functions [149]. Alternatively, it may be advantageous to distinguish equations corresponding to boundary nodes, which may exhibit different behavior, or may be discretized differently [150].

Navier-Stokes problems
We start with coefficient matrices arising from general PDE systems. As an example we consider the primitive variables formulation of the steady incompressible Navier-Stokes equations, which couple the velocity u and pressure p of a fluid via the following equations: subject to appropriate boundary conditions, on a bounded connected domain Ω ⊂ R d , d ∈ {2, 3} [128, chapter 8]. The parameter > 0 here represents the kinematic viscosity.
Since the Navier-Stokes equations are nonlinear, linearization is required, with popular choices including Newton and Picard iteration. We consider Picard iteration here, which gives rise to a sequence of Oseen problems. Care is also needed when discretizing (5) to ensure that the resulting method is stable; this is related to the inf-sup Ladyzhenskaya-Babuška-Brezzi (LBB) conditions [128, chapter 3]. If a stable discretization is used, such as an LBB-stable finite element pair or the marker-and-cell finite difference scheme, then each Oseen problem is a saddle point system of the form Here, A ∈ R n×n contains the diffusive and convective terms while B ∈ R m×n , m ≤ n, represents the discrete negative divergence. We assume for simplicity that A is invertible and that B is full rank. (For conditions on the invertibility of  see [151, section 3.2].) Note that the (2, 2) block in (6) may be replaced by a negative semi-definite matrix to stabilize a discretization, and many of the preconditioners we discuss can be adapted to deal with this structure. One widely used option is to precondition  monolithically, for example via a multilevel method such as multigrid or domain decomposition [151, section 11]. Alternatively, given the saddle point form of the Oseen equations, a number of preconditioners described in Section 4 can also be applied to (6). Provided is large enough, preconditioners designed for the Stokes equations (cf Section 3.2.2) might also be suitable.
Yet another option is to exploit the block structure of  when designing preconditioners. For problems in which different blocks are associated with different physical quantities, or different PDEs, block preconditioners are often known as physics-based preconditioners. These block preconditioners may take advantage of the specific problem structure by, for example, separating different spatial dimensions [152,153]. Alternatively, many popular preconditioners are based on the following "ideal" choices that exploit only the saddle point structure [154]: where S = BA −1 B T denotes the (negative) Schur complement. Related block diagonal preconditioners can also be applied, but we defer discussion of these to Section 3.2.2. Both  u and  l are ideal, since a method like preconditioned GMRES will converge in at most two iterations with either choice. On the other hand, the size of A makes forming S and applying  u or  l too costly in general, and so many practical preconditioners are based on approximating A and S [128,[155][156][157]. We stress that this block triangular approach can be applied to any saddle point problem, for example porous media flow [158,159].

Stokes problems
When advective inertial forces are small compared with viscous forces in (5) we can approximate the Navier-Stokes equations by the linear Stokes equations: , 3}, with appropriate boundary conditions. Discretization of these equations by, for example, an LBB-stable finite element pair or the marker-and-cell finite difference method will lead to a saddle point problem of the form (6) in which  is symmetric. Specifically, A ∈ R n × n is a discrete representation of the negative vector Laplacian and B ∈ R m × n , m ≤ n, represents the discrete negative divergence (cf Section 3.2.1). We assume for simplicity that A is symmetric positive definite and that B has full rank although, as for the Navier-Stokes equations, these conditions may be relaxed [151].
The matrix  is symmetric but indefinite, making it somewhat challenging to solve (6). It is possible to instead solve the symmetric positive (semi-)definite Schur complement system where S = BA −1 B T is again the (negative) Schur complement (cf Section 4), but since our focus here is on dealing with systems of PDEs we instead focus on solving the full system (6). Given the similarity of the discretized Stokes equations with the Oseen problem, it is not surprising that many of the preconditioners discussed there can be adapted for the Stokes equations. However, since  is symmetric, it may be desirable to use a symmetric Krylov method, such as preconditioned MINRES, which requires a symmetric positive definite preconditioner. Doing so precludes certain monolithic and block triangular preconditioners.
When a positive definite preconditioner is required, a natural choice for (6) is a block diagonal preconditioner, such as [154,160] for which preconditioned MINRES converges (in exact arithmetic) in at most three steps. On the other hand, as for the block triangular preconditioners for the Oseen problem, in practice A and S must be approximated. Since A is a vector Laplacian, methods for the Poisson equation (see Section 3.1.1) can be used in its approximation. Thus, as for many block problems, the challenge lies in approximating S. Luckily, analysis (see, eg, [128, chapter 4]) shows that when an LBB-stable finite element pair is used, a good approximation to S is the mass matrix on the pressure space, Q, or an approximation, for example its diagonal, a lumped version, or a fixed number of steps of Chebyshev semi-iteration [161][162][163]. Additionally, is an operator preconditioner [55]. For the marker-and-cell finite difference scheme, life is even easier since we can replace S by the identity matrix. We note that the preconditioner (8) has been generalized to certain problems containing more blocks [164][165][166][167], and that in general block diagonal preconditioning is a popular choice for symmetric saddle point problems [168][169][170][171]. The operator preconditioning approach can also be more widely employed; for more examples, see [55]. One aspect that is not often taken into account by block preconditioners is the appropriate scaling factors for the different blocks. Scaling can have an effect on the number of iterations needed for convergence [172,173]. More fundamentally, different equations are expressed in terms of different physical units, and these units may not be faithfully represented in the preconditioned system without scaling the blocks by relevant physical parameters, for example the viscosity in the Stokes equations [174]. This need becomes more obvious when dealing with varying parameters, as in two-phase flow [155,175].

Toeplitz matrices
When constant-coefficient PDEs are discretized on uniform grids on simple domains, Toeplitz and multilevel Toeplitz matrices may arise. In this case, the highly structured nature of these matrices can be exploited to construct fast solvers. Although these structures have been utilized in the context of PDEs for some time [176,177], recent interest in the discretization of fractional differential equations (FDEs) has proved a new source of (multilevel) Toeplitz problems [178,179]. This is because nonlocal fractional operators make the solution of the required linear systems extremely challenging unless tailored approaches are employed. When one-dimensional linear constant-coefficient ordinary differential equations with Dirichlet boundary conditions are discretized on a uniform grid by, for example, standard finite difference methods, the resulting coefficient matrix is Toeplitz, that is, of the form If the boundary conditions are instead periodic then A is not only Toeplitz, but also circulant, that is, (A) ij = (A) (i − j) mod n . In two or more spatial dimensions the discretization of PDEs no longer leads to Toeplitz matrices in general. However, when linear constant-coefficient problems are discretized on uniform tensor-product grids we may obtain multilevel Toeplitz matrices. The best known of these are block Toeplitz with Toeplitz block (BTTB) matrices for two-dimensional problems.
Although for ordinary or partial differential equations the resulting (multilevel) Toeplitz matrices are typically sparse, the nonlocality of fractional operators means that matrices arising from FDE applications are dense. It is unusual for Krylov subspace methods to be competitive with direct methods for dense matrices, but here we can compute matrix-vector products in O(n log(n)) operations by embedding A in a 2n × 2n circulant matrix and utilizing the fast Fourier transform (FFT) [180, section 4.8.2].
Toeplitz matrices from differential equations are usually generated by a symbol, f , in the sense that the matrix entries are its Fourier coefficients, that is, (Multilevel Toeplitz matrices lead to multivariate symbols.) This connection with the symbol is extremely powerful, and many properties of A can be deduced from f . For example, if f ∈ L 1 ((− , )) is real-valued then A must be Hermitian and where the inequalities are strict if f is not constant almost everywhere. Asymptotic (in n) eigenvalue or singular value distributions of A can also be determined from f [181][182][183][184]; since the symbols of differential and integral operators typically have zeros, these results clearly illustrate why A is often extremely ill-conditioned for large n (as is well known).
Noticing that the inverse of a circulant matrix can also be applied to a vector in O(n log(n)) operations using the FFT, Strang [185] and Olkin [186] proposed certain circulant preconditioners. This led to the development of a number of other preconditioners based on circulants [187,188], fast transforms [189,190], banded Toeplitz matrices [191,192], and multigrid [193] (see the survey by Chan and Ng [194] and the books [195,196]). For multilevel Toeplitz matrices, multilevel circulant preconditioners are not optimal since iteration numbers increase with n [197,198] but multigrid methods, for example, can give mesh-independent convergence rates [199], and semi-circulant preconditioners can be effective for certain PDE problems [176,177].
The vast majority of preconditioners proposed for Toeplitz problems were originally designed for Hermitian matrices generated by a smooth symbol f , for which estimates of the eigenvalues of the preconditioned matrix can be obtained and related to the convergence rate of the Krylov method [191,200]. More recently these sorts of results have been extended to a wider class of problems, known as (block) generalized locally Toeplitz matrices [201,202]. For non-Hermitian matrices it is often possible to prove that the singular values and/or eigenvalues of the preconditioned matrix are clustered. Unfortunately, with the exception of LSQR or LSMR (or similar symmetrizations), it is not possible to link these results to convergence rates of Krylov subspace methods. An exception can be made in specific cases, for example [177], in which the condition number of the eigenvector matrix is also bounded. Another possibility is to use the specific Toeplitz structure to symmetrize, as described in Section 5.

Time-dependent problems
Time-dependent problems present new, often significant, challenges for efficient numerical methods. The most common way to solve time-dependent PDEs is by a method-of-lines approach, which gives rise to a sequence of problems (one for each time step), that must be solved sequentially. In this case, preconditioners are typically based on those for the analogous steady problem. These time-dependent systems can sometimes be easier to solve than their steady counterparts, in the sense that matrix blocks can have more convenient spectral properties. This is because time-stepping typically introduces a positive definite matrix, for example a mass matrix or identity matrix scaled by the inverse of the time step. For large time steps, however, the effect is usually small, and a good preconditioner should work for the range of time steps appropriate for the given problem. An alternative approach is to solve for all time steps simultaneously. This is sometimes called an "all-at-once" or "one-shot" method. Any parallel time integration scheme that can be described by a linear operator may be used as a preconditioner, with an overview of possible methods, for example domain decomposition and multigrid approaches, given in the survey by Gander [203]. Other techniques include matrix equation-based preconditioners [204] (with a flexible Krylov method if the matrix-equation solver is nonlinear) or, in the case of uniform time steps, preconditioners that exploit block Toeplitz structure [205,206]. A disadvantage of the all-at-once approach is that the sizes of all time steps need to be determined in advance. However, this can still be useful in certain settings, for example predictor-corrector methods [207] or in optimization problems [208][209][210].

Problems with heterogeneous coefficients
Problems with heterogeneous coefficients are in general more difficult to precondition than those without. They occur frequently when modeling real-world phenomena, often arising in nonlinear problems. The simplest such problem is a diffusion problem with varying diffusivity, that is, (3) with nonconstant . An elementary option that can be effective for such problems is to first scale the coefficient matrix by its diagonal [211], which can reduce the effect of , and then apply a standard Poisson preconditioner. Alternatively, many of the techniques described above, such as multigrid [212,213] and domain decomposition [88,211,214], have been adapted to deal with problems with heterogeneous coefficients by, for example, using more sophisticated smoothers, scalings, or coarse grid solves. Recently, insights into the performance of simple Laplace preconditioners for (3) were obtained via careful eigenvalue analysis [67]; this, similarly to [211], showed that in the case of heterogeneous problems, looking at the condition number of the preconditioned coefficient matrix alone may not be indicative of the effectiveness of the preconditioner. Finally, we note that it may be advantageous to reformulate the system. For example, (3) can be rewritten as a system of first-order equations, for which preconditioners may be more readily constructed [168,215].

Other methods
A sparse approximate inverse (SPAI) preconditioner for A may be defined as P −1 = R, where R minimizes ||I − AR|| subject to R also possessing some sparsity pattern. This is most commonly considered in the Frobenius norm (as initially investigated in [216], see also [217][218][219][220][221][222][223]), for which the minimization problem then reduces to separate least-squares problems for each column of R that may be solved in parallel. Among many important works on the subject, we refer to [4, chapter 5] for a detailed summary of such preconditioners, to [224] for a description of Grote and Huckle's successful SPAI algorithm for dynamically defining a sparsity pattern for R, to [225] for a discussion of nonsymmetric linear systems, and to [226][227][228] for comparative studies of different SPAI preconditioners and sparsity structures. In the context of PDEs, a key application is the use of Frobenius norm-based preconditioners for large dense systems resulting from boundary integral formulations of electromagnetic problems [217,218,229,230]. Other important applications of approximate inverse preconditioners arise in elasticity problems [219,222,223], magnetohydrodynamics [231], and acoustic simulations [220]. It is also possible to construct preconditioners based on fixed point iterations. Although simple approaches based on Jacobi and SOR are not often used on their own for PDE problems, splittings that are tailored to the PDE in question can form the basis of effective preconditioners. One example is a splitting according to the spatial dimension, which is the basis for the alternating directions implicit method [232] and related approaches like the dimension splitting method [152,153], and the Hermitian-skew-Hermitian splitting method (see Section 3.1.3).
A deflation preconditioner is a projection that aims to (approximately) remove a subspace responsible for slow convergence [19,86,233,234]. A common motivation is to remove the eigenspace associated with small eigenvalues, although in theory any subspace could be removed. On the other hand, it is also possible to precondition to augment the Krylov subspaces in which a solution is sought by adding search directions that accelerate convergence [19,234]. Closely related to these ideas are polynomial preconditioners, that is, preconditioners of the form P −1 = p(A), for some polynomial p. These aim to incorporate spectral information, so that the properties of P −1 Ax = P −1 b, where P −1 A = AP −1 = Ap(A), are more favorable [16, section 12.3].
We have already seen how dense Toeplitz problems may be solved by Krylov subspace methods. Iterative methods can also be applied to other dense, but data-sparse, matrices stored in a format that enables fast matrix-vector products, where a matrix is data-sparse if it can be stored in some representation using far fewer than n 2 degrees of freedom. For PDEs, the most common structure is the presence of low-rank off-diagonal blocks. Intuitively, this is because far-range interactions tend to be relatively weak, and so can be well approximated by low-rank structures. Data-sparse formats include -matrices [136,235,236],  2 matrices [237], hierarchically semi-separable matrices [238], the fast multipole method [239], and recursive skeletonization [240]. Often, by storing matrices in these formats, it is possible to accelerate matrix-vector products within Krylov subspace methods. However, it may also be possible to use these approaches for preconditioning [241,242].
We conclude this section by commenting on a technique that can be applied when a problem has two (or more) preconditioners that accelerate the convergence rate of a linear system. While it may be difficult to formulate a new preconditioner that captures the good behavior of each individual preconditioner, they can be combined within a multipreconditioned iterative solver [243,244]; this has been successfully used in the solution of parameter-dependent problems [245] and two-phase flow [246]. More generally, when dealing with sequences of problems it can make sense to reuse information from previously computed systems, including preconditioners [19, section 14] (see also Section 4.4).

PRECONDITIONERS FOR OPTIMIZATION PROBLEMS
Another major field of research which necessitates the solution of linear systems of equations describing a physical process is that of optimization. As an illustration, if one wishes to solve the constrained quadratic programming problem: where H is symmetric positive semidefinite, one may construct the Lagrangian: Finding a stationary point of  then leads to a linear system exactly of the saddle point form (6), with A = H. Further matrices may arise within this system: for example, when solving problems with additional inequality constraints imposed on x using an interior point method, terms arising from a barrier function enforcing these constraints lead to an additional diagonal matrix which is added to A [247]. Further matrix terms can also arise in the (2, 2)-block of the saddle point system as a result of regularization terms applied to solve optimization problems. Often, for example in the case of linear programming problems [248][249][250][251], A can be a diagonal matrix as H = 0. In this case it is helpful to reduce the system to the equivalent Schur complement system (7) (often referred to as the normal equations within the optimization literature), provided A is invertible.
In optimization it is important to consider the advantages and disadvantages of solving the Schur complement system or the saddle point system, especially if A −1 has a particularly convenient structure. The Schur complement is guaranteed to be symmetric positive semidefinite; however if B has full row rank the reduced system may be significantly more ill-conditioned than the matrices A and B themselves, and moreover with increased density. Examining instead the saddle point system circumvents these issues, however this system is indefinite and is (often significantly) larger than the Schur complement matrix.
Another important consideration is that, in optimization, frequently very little is known about the structure of the linear system, apart from the matrix being sparse in many applications (which in general we assume in this section).
This contrasts with preconditioning PDE problems, where some more descriptive properties of the system are generally known a priori, due to the features of the analogous infinite-dimensional problems for instance. Therefore, the strategies used to precondition systems in optimization often need to be more general, to account for varying structures arising in the sub-blocks of the linear systems.
In this section we give a representative summary of some classes of methods used to precondition linear systems arising from optimization problems in the literature. We wish to emphasize that many strategies arising from the two areas of PDEs and optimization can have considerable similarities and can sometimes be transferred between the fields, notwithstanding the particular challenges that optimization problems pose. For instance, by exploiting the saddle point structures of matrices that often result, block preconditioners for PDEs like those in Section 3.2 can also be applied to linear systems arising from constrained optimization problems. It is also possible to apply many of the approaches outlined in this section to systems arising from PDEs, and we give examples in Section 4.6, which considers optimization problems with PDEs themselves acting as constraints. We present the following methods for optimization problems in this way because we wish to provide a self-contained summary of some preconditioners within the field of optimization, which we believe cannot be found in much of the literature, and which highlights in particular some methods that require relatively little knowledge about the precise structures of the matrices being tackled. In particular, some of the preconditioners we propose-notably incomplete factorizations-can be applied to any coefficient matrix A, not just the optimization problems considered in this section.

Incomplete factorizations
As discussed previously, solving large-scale problems using direct methods can rapidly become impractical as the dimension of the system increases. Computing an LU factorization of a matrix M = LU (or M = LDU) can result in L, U being significantly less sparse than M, a feature known as fill-in, although there are strategies in place to mitigate this including reorderings [1,252,253]. The same also applies to computing a Cholesky factorization M = LL T of a Hermitian positive definite matrix (or a factorization M = LDL T of systems that need not be positive definite). We note that M could denote a matrix relating to the entire system being solved, or a sub-block of it. The idea of incomplete factorization is that one attempts to form an approximation of such a factorization, that at the same time is cheap to apply. As with preconditioning in general, this necessitates a trade-off between computational efficiency and accuracy of approximation, which in this case corresponds to how aggressively one may drop fill-in versus how potent an approximation results from this strategy. Essentially, algorithms are required that determine relevant criteria for the dropping of fill-in, based on position in the matrix, or value/magnitude of the entry. An incomplete Cholesky factorization of a matrix M then leads to where E is the error induced by the dropped terms. The same applies to incomplete LU (ILU) factorization. There has been a substantial amount of research into devising incomplete LU and incomplete Cholesky factorizations, including within preconditioners [252,[254][255][256][257][258].
One possible choice is to select sparsity patterns for the incomplete Cholesky or ILU factorizations which are the same as the matrix M itself. This leads to the ILU(0) or IC(0) factorizations. While these will be comparatively cheap to work with, they may contain rather little information about the matrix M as a result. However, this has nonetheless been shown to be effective for preconditioning certain classes of matrices [13,259]. This methodology can be generalized to IC( ) or ILU( ) preconditioners (where is a positive integer) to achieve different levels of fill-in. We refer to [4, sections 3 and 4] for a thorough survey of such methods, as well as research on the existence and stability of block factorizations and preconditioners.
We highlight that triangular factorizations need not be computed using Cholesky for positive definite matrices: for example in [260] Benzi and Tůma compute such a factorization by orthogonalization with respect to the inner product generated by M. Among much other research in the area of incomplete factorization preconditioners, there has also been a focus on multilevel and multifrontal preconditioners [257,258,261,262], some of which can give rise to parallel computing technologies.
Within optimization, incomplete factorizations form a valuable tool, which can be used as an effective preconditioner for the normal equations, or as an approximation for one block of a saddle point system. There have been many such applications within the field, for example preconditioning linear systems arising from the solution of linear programming problems [248,263] (see also [249]), nonlinear optimization [264,265], least-squares problems (as in Section 4.5), and many others. Incomplete factorizations are also widely used in applications beyond optimization, since they can be computed using only matrix entries.

Constraint preconditioning
One key class of preconditioners for saddle point systems arising in optimization, of the form (6), is that of constraint preconditioners [266][267][268]: These preconditioners are so called because at each iteration of an appropriate Krylov method, the constraints (corresponding to the matrix B) are exactly satisfied in exact arithmetic [266,269]. If A is symmetric positive definite on the kernel of B, which for this discussion we assume is of full row rank, thenÂ is typically selected to be another symmetric positive definite matrix. We highlight that  is not a symmetric positive definite matrix, and yet it may be applied within Krylov methods such as CG and MINRES: this being equivalent to the projected CG and projected MINRES algorithms [269], as the preconditioner is positive definite on the constraint manifold [266]. Furthermore, all eigenvalues of the preconditioned linear system  −1  that are not equal to one, are equal to those ofÂ −1 A [267]. An effective constraint preconditioner requires a method to construct a matrix of the form (9), ideally such that the resulting matrixÂ is a potent approximation of A and such that its inverse is cheap to apply to a vector. One widely used approach for constructing such a preconditioner is to make use of a Schilders factorization [9,270,271], whereby one decomposes (possibly after reordering of the linear system) B = [B 1 B 2 ] such that B 1 is invertible, then takes as a preconditioner: This factorization is guaranteed to equate to a matrix of the form (9), for any D 1 , L 1 , G of the correct dimensions, and for nonsingular matrices D 2 , L 2 . The key task at this point is to construct matrices that well approximate the (1, 1)-block of the saddle point system (6). We note that alternative factorizations may also be derived, for instance based on null-space factorizations (see [272] for a summary of some such approaches, and [151] for a statement in the context of constraint preconditioning). Constraint preconditioners may also be constructed for generalized saddle point systems, of the form (6) but with a nonzero (2, 2)-block. Such systems can arise from optimization problems with additional regularization, and associated constraint preconditioners are considered in, for example, [265,270,271,273].
Within the field of optimization, constraint preconditioners have been devised for linear and quadratic programming problems in [247,248,266,267,[273][274][275], and additionally for nonlinear optimization problems in [265,267,268,271,276,277].
We highlight a number of further studies into the field of constraint preconditioning. Theoretical properties and convergence results are studied in detail in [278], constraint preconditioners for nonsymmetric indefinite linear systems are analyzed in [279], and a new limited memory approach with the objective of improving the stability and robustness of constraint preconditioning is described in [280]. Such preconditioners have also been devised for optimization problems with PDE constraints [281], including within the field of topology optimization involving PDEs [282].

Augmented Lagrangian preconditioning
Another class of preconditioners for saddle point systems (6) from optimization are so-called augmented Lagrangian preconditioners: This type of preconditioner is particularly useful if the (1, 1)-block A of (6) is singular, but is positive definite on the kernel of B. In this case many classical saddle point preconditioners cannot be straightforwardly applied, but (10) is invertible for any symmetric positive definite matrix W, and furthermore is symmetric positive definite so may be applied within the MINRES algorithm. The eigenvalues of the resulting preconditioned system are analyzed in [283], demonstrating many enjoyable properties, and much original work on such preconditioners was tailored to solving PDEs, see [143,144,284] for instance. Of course, the effectiveness of the preconditioner depends on a suitable choice of W. One frequent choice is that of W = −1 I, which is analyzed in detail in [285], especially in relation to the selection of the parameter > 0.
One link with the field of optimization (relating in particular to the naming of the preconditioner) is through the augmented Lagrangian idea in optimization [9,286,287], particularly useful in cases where A is an ill-conditioned matrix, which notes that the system (6) is equivalent to the system: .
Indeed the augmented Lagrangian preconditioner has been applied widely in optimization: one of the first such applications was to primal-dual interior point methods for linear and quadratic programming problems [288], see also [285,289] for other applications in optimization. The augmented Lagrangian preconditioner idea has been extended to nonsymmetric saddle point systems in [143,290], and has also found applicability to regularized systems where the (2, 2)-block of the saddle point system is nonzero [289,291]. It is also possible to construct block triangular analogues of (10) for saddle point systems [9,143,288,290,291].

Low-rank updates of preconditioners for sequences of linear systems
A research area of active interest of late within the optimization community is the solution of sequences of linear systems each of which may be of saddle point form (6), for instance. These can arise, for example, through the solution of nonlinear equations or an optimization problem using a Newton-or Broyden-type method, yielding a sequence of subproblems of the form where J(x k ) is the Jacobian evaluated at the current iterate x k . It has been found that constructing low-rank updates of preconditioners may be highly effective. For instance, assuming one starts with an effective preconditioner  0 for J 0 : = J(x 0 ), one may successively "correct" preconditioners at each iteration using rank-1 updates for  k : and then use the Sherman-Morrison formula to apply  −1 k+1 to J k + 1 : = J(x k + 1 ) (see [292] for a discussion). This motivation may of course be generalized to updated preconditioners of higher ranks. In a similar spirit to that of SPAI preconditioners, for instance, such preconditioners may be constructed, with a reasonable objective being that ||I −  −1 k J k || is bounded and small.
Crucial early work in this direction includes the papers of Martínez [293,294], where improving a preconditioner through low-rank updating is suggested. The preconditioners derived in these works were termed secant preconditioners, because they were required to satisfy the "secant property": The idea of updating a previously computed matrix to accelerate the CG method, by saving information in the form of a quasi-Newton matrix, is also examined in [295], based on earlier research in this area such as in [296,297].
Preconditioners for the inexact Newton method were devised, for example, in [292,298], based on Broyden-type rank-1 updates [292] and BFGS rank-2 updates [298], respectively. Preconditioning of nonsymmetric systems using updates of incomplete factorizations is discussed in [299], and updated preconditioners which may be applied in a matrix-free setting are derived in [300].
Another research area of interest in optimization is the construction of low-rank updates of constraint preconditioners for quadratic programming problems [301][302][303]. In particular the paper [303] extends work in [304], where limited memory preconditioners for symmetric positive definite systems are considered. We also highlight the work [305], where low-rank updates of preconditioners are constructed for data assimilation, with application to the 4D-Var weather prediction model.
Finally, we note that low-rank updates of preconditioners can also be helpful when solving sequences of linear systems within applications other than optimization, for example the solution of PDEs (particularly time-or parameter-dependent problems), nonlinear problems, eigensolvers, and rational Krylov methods for matrix functions. In the particular, but important, case that these systems are all of the form A = M + I, with changing, modifying a given preconditioner for such a sequence is considered in, for example, [306][307][308]. We refer to [309] for a recent survey of low-rank updates of preconditioners, which considers a number of the above applications.

Preconditioners for least-squares problems
Another specific class of problems that has gained much interest in recent years is that of least-squares problems: Differentiating the square of the functional in (11) with respect to x to find the stationary point leads to the normal equations: which in turn may easily be shown to be equivalent to solving the augmented system: Similarly to when we consider the respective merits of solving saddle point systems of general form (6) versus Schur complement systems (7), there are advantages and disadvantages of solving the augmented system and the normal equations.
We first refer to two important articles summarizing developments in preconditioning for least-squares problems: [310] considers the solution of the normal equations using the conjugate gradient for least squares (CGLS) method before proposing new strategies, and [311] presents a thorough numerical study of preconditioners for the normal equations and the augmented system. We summarize the findings of these two papers very briefly here.
A number of highly effective strategies exist for preconditioning the normal equations (12). Apart from Jacobi (diagonal) preconditioning (see [311]), one may apply incomplete Cholesky methods (provided A has full column rank), including limited memory approaches [312][313][314]. One may also make use of incomplete orthogonalization methods [315,316], including incomplete QR (IQR) decompositions [317,318] based on Gram-Schmidt orthogonalization with no dropping in Q, a method that is termed compressed incomplete modified Gram-Schmidt. IQR methods have also been extended to multilevel IQR approaches [319]. Another valuable class of approaches for the normal equations are robust and balanced incomplete factorizations, see [260,310,320].
We also highlight a number of other developments regarding the iterative solution of least-squares problems, including the limited memory preconditioner for the normal equations developed in [2], preconditioners based on approximate inverses [323,324], the solution of such problems based on augmented matrix methods [325, chapter 7], and preconditioners based on LU factorizations for both the normal equations [326] and the augmented system [327]. Specialized preconditioners can also be devised for least-squares problems with more specific properties, such as weighted Toeplitz least-squares problems [328] (cf Section 3.3).

4.6
Link to PDEs and PDE-constrained optimization As above, although PDEs and optimization form two separate branches of mathematics, there is considerable scope for many of the preconditioning strategies presented in this section to be used to tackle the linear systems discussed in Section 3 that arise from discretizations of PDEs. For example, ILU factorizations have been extensively applied to PDE problems (see [329,330] for instance), and incomplete factorizations find considerable applicability as smoothers within multigrid [68]. Constraint preconditioners [278,[331][332][333] and augmented Lagrangian preconditioners [143,144,284,334] have also been widely used when solving PDEs. However, it can also be that the nature of PDE and optimization problems are distinct, and so one must carefully tailor preconditioned iterative solvers to the problem at hand. For example, PDE problems typically offer more specific matrix structures which may be exploited by preconditioners, whereas optimization solvers frequently need to be more general. The differing philosophies between these two classes of problems is one reason why the methods presented in Sections 3 and 4 are distinct.
We conclude this section with one particular class of problems in which PDEs and optimization problems intersect: that of PDE-constrained optimization. To give an illustration of iterative solvers for the resulting linear systems, consider the Poisson control problem: given a domain Ω ⊂ R d , d ∈ {2, 3}, a functionû, and a positive regularization parameter , as well as appropriate boundary conditions. The variables u and f are often referred to as the state variable and control variable, respectively. Discretizing this problem using finite elements leads to a system of the form [52,335]: where Q denotes a mass matrix and K a stiffness matrix, and with x 1 , x 2 corresponding to the discretizations of u, f . Poisson control problems of this type have been considered from both the infinite-dimensional [52,55,167,174,336] and finite-dimensional [281,335,337,338] points of view. In particular by viewing the problem on the infinite-dimensional level, for instance by considering nonstandard norms, and then discretizing the resulting infinite-dimensional interpretation of the preconditioner, one may arrive at the optimal preconditioner [52,55]: This methodology may also be applied in the setting of constraint preconditioners [336], for instance. On the other hand, by considering the properties of the saddle point system purely on the discrete level, one may also apply the preconditioner [335]: which is also optimal. This approach may itself be extended to block triangular preconditioners [335,338], for instance. Both preconditioners lead to provably mesh-and -independent solvers for the problem (14), and we note that both preconditioning "routes" lead to different (if related) structures. Interestingly, both the "infinite-dimensional" and the "discrete" routes are known to deliver mesh-and parameter-robust solvers for particular PDE-constrained optimization problems, for which the other approach as yet cannot. For instance, deriving preconditioners on the infinite-dimensional level and then discretizing these operators has led to potent solvers for the Stokes control problem [52], and a number of time-periodic problems [339]. On the other hand, considering the optimization problem using a purely discrete (saddle point) framework has similarly generated robust solvers for convection-diffusion control problems [340], and distributed time-dependent problems [341].
We highlight that there also exist a range of other methods for the Poisson control problem (14), for example modified Hermitian and skew-Hermitian methods [342], nonsymmetric preconditioners which are not of block triangular form [343], and constraint preconditioners [281]. Of course adjustments can be made to problems of the form (14), that lead to more complex structures. Preconditioning linear systems arising from such problems is an active research area, with too many references to mention here. To give an illustration, there has been work on deriving preconditioners for problems with additional algebraic box constraints on the state and/or control variables [344][345][346][347][348], problems with limited observation [166], and parameter estimation problems [349], along with linear systems arising from a range of other scientific applications.

PRECONDITIONERS WITH "NONSTANDARD" GOALS
In the previous two sections we have summarized a range of problems for which preconditioned iterative methods have found considerable utility, and a number of solution strategies that have been employed. In each case the key objective was to achieve rapid and robust convergence of an iterative method for solving a linear system Ax=b, in a way which cannot be guaranteed by a direct method or an unpreconditioned iterative scheme. However this is not the limit of the capability of preconditioners. For instance one may also solve problems of a different general form using preconditioned solvers, such as eigenvalue problems, or alter the structure of an original problem statement. One may also have an additional motivation behind implementing a preconditioner in the first place: for instance reducing computer storage requirements, quantifying uncertainty in the solution, or implementing a method on high performance computing (HPC) architectures. In this section we very briefly survey valuable research which has been undertaken on preconditioning strategies for achieving such "nonstandard" objectives.
• Eigenvalue problems: Consider the problem of finding a simple interior eigenvalue of a symmetric matrix A, and its associated eigenvector x. Algorithms for this include inverse iteration or Rayleigh quotient iteration (RQI), both of which require the solution of linear systems where x k and x k + 1 = y k /||y k || are successive approximations to an eigenvector x.
When A is large, and matrix-vector products with A are easy to apply, it is attractive to solve (15) using a Krylov subspace method, such as MINRES. This leads to an inner-outer scheme, in which MINRES is the inner method, and inverse iteration or RQI forms the outer iteration. Although it is still possible to achieve the outer iteration convergence rate obtained when (15) is solved exactly, this generally necessitates that a more stringent MINRES tolerance is used as the outer iteration proceeds [350][351][352][353][354]. For a generic linear system, decreasing the tolerance would increase the number of MINRES iterations, but perhaps surprisingly this is not the case here: as the outer iteration proceeds the right-hand side in (15) better approximates an eigenvector of A − I, making the system easier to solve. As a result, the number of inner iterations remains bounded [350,355]. (This clearly illustrates the role the right-hand side plays in determining the convergence rate of a Krylov subspace method.) This property is usually destroyed when preconditioners are applied: although a good preconditioner will be effective at early outer iterations, decreasing the preconditioned MINRES tolerance typically increases the number of inner iterations. This is simply due to the fact that the preconditioned right-hand side no longer approximates an eigenvector of the preconditioned matrix. To remedy this, it is necessary to choose the preconditioner so that the preconditioned right-hand side approximates an eigenvector. This is a rather unorthodox view of preconditioning, but here proves to be just what is needed.
To see how this is achieved, consider the idealized situation that (A − I)y k = x, that is, that the right-hand side is precisely the desired eigenvector. Then, for any positive definite preconditioner P, if u=(A − P)x, is such that P tuned x = Ax. It can then be shown that preconditioned MINRES applied to this system would converge in exactly one step [355]. Although such a preconditioner is infeasible in general, replacing x by x k and u by u k : = (A − P)x k results in a preconditioner for which the number of inner iterations remains bounded as the outer iterations proceed. Moreover, by making use of the Sherman-Morrison formula, applying P tuned is not significantly more challenging than applying P [180, section 2. 1.4]. This idea has been analyzed for nonsymmetric generalized eigenvalue problems [356], and applied in inverse subspace iteration [357] and Arnoldi-type eigensolvers [358,359].
We have focused here on preconditioning the inner solves within iterative eigenvalue solvers. However, for some Krylov-based eigenvalue solvers, for example certain Lanczos-based methods, it is possible to precondition the eigenvalue solver itself to accelerate convergence to the desired eigenvalues [360][361][362]. • Symmetrization: As we saw in Section 2, MINRES is only applicable when a real coefficient matrix is symmetric, while CG additionally requires positive definiteness. It is somewhat surprising, therefore, that at least in theory any linear system with a real nonsymmetric coefficient matrix A ∈ R n×n , can be solved using MINRES (or possibly CG) after appropriate preconditioning. This is because A can always be factored as A = Y −1 T with Y , T symmetric [363], which implies that YA must be symmetric. Here, the preconditioner Y serves not to improve the convergence rate but to enable MINRES or CG to be used, with their short-term recurrences, optimality properties, and comprehensive convergence theory. All is not rosy, unfortunately, since computing Y usually requires the Jordan normal form of A, which is prohibitively expensive to compute, and almost always results in a dense matrix Y that is too costly to apply. There are some exceptions, however: for certain structured matrices, a sparse Y is obtainable without recourse to the Jordan form. In these cases symmetrization is an attractive option. For example, if A is Toeplitz and then YA is a Hankel, hence symmetric, matrix. This well-known fact has been exploited to develop solvers that are typically faster than GMRES applied to the original system, thanks to the short-term recurrences of MINRES [364][365][366]. We stress that Y does not, in general, improve the convergence rate [367,368]; a second, symmetric positive definite, preconditioner is almost always required. However, in contrast to the nonsymmetric system, the preconditioner choice can be guided by MINRES convergence theory, and mathematical guarantees of fast convergence can be obtained. More generally, any persymmetric matrix is symmetrized by the reverse identity matrix, while 2n × 2n Hamiltonian matrices are symmetrized by A matrix that is symmetrized by Y as described above is self-adjoint with respect to the symmetric bilinear form [369, chapter 1] ⟨x, y⟩ Y = y T Y x. When this bilinear form is an inner product it can be used to define Krylov subspace methods with optimality properties, such as Y -CG or Y -MINRES [19, section 13]. Even for standard CG and MINRES the choice of preconditioner essentially fixes the inner product in which the method is defined [370]. By utilizing a different inner product, it may be possible to find a preconditioner P such that P −1 A is self-adjoint in the Y -inner product, even if A and/or P are not symmetric [371,372].
Combinations of preconditioner and inner product need not only result in symmetry to be advantageous. For example, when the symmetric part of A, namely Y = (A + A T )/2, is positive definite, Y −1 A is the sum of the identity matrix and a matrix that is skew-adjoint with respect to the Y -inner product. In this case a short-term recurrence method can be applied [373,374].
• Ill-posed problems: For ill-posed problems Ax=b, the matrix A typically has singular values that decay to zero and a right-hand side that is affected by noise, that is, b = Ax true + . For such problems, we would like to approximate x true . However, the ill-conditioning of A means that approximations to x and x true may differ greatly even when || || is small. Under certain conditions, iterative methods applied to ill-posed problems do approximate x true at early stages, but thereafter this noise dominates, at which point approximations to x true get worse again. When this phenomenon, which is known as semi-convergence [375][376][377], occurs the iterative method itself acts a regularizer of the problem. A good preconditioner P will also make it easier to approximate x true , but this does not mean that we should choose P ≈ A. Indeed, if P is a good representation of A it will almost certainly be ill-conditioned, and so its application may significantly amplify errors at early iterations. Instead, P should also regularize the problem somehow. Typically, one wants to damp the high-frequency components associated with small eigenvalues or singular values of A. A common option is to set these high-frequency components to one [378][379][380], but another option is to remove them completely via a singular preconditioner [381,382] or a projection [383]. We note that other types of regularization, for example Tikhonov, can also be enforced via preconditioning [384,385]. A different interpretation is to develop preconditioners for ill-posed problems using a Bayesian framework [386] (see the discussion on probabilistic solvers below). • Randomized preconditioning: An area of significant recent interest in the linear algebra community is that of randomized linear algebra. In particular, randomized preconditioners may be developed for least-squares problems of the form (11). A key work is that of [387], where overdetermined problems are considered (that is, where A ∈ C m×n , m ≥ n), although the authors' methods can be extended to underdetermined regression. Here the problem (11) is replaced by a problem of minimizing ||AP −1 y−b|| over vectors y, whereupon the solution x = P −1 y. The solver incorporates a randomized transformation, with the preconditioner based on a permutation of the R factor from the QR decomposition of the transformed matrix A. Another important work is [388], where the randomized methods may be viewed as preconditioning the matrix A and vector b with a data-independent random matrix P. In [389], a high-precision preconditioned solver is presented, where the rows and columns of A are blended, at which point a random sampling of rows is taken. We also refer to least-squares solvers based on random normal projection in [390,391], with [391] deriving a parallel solver for strongly over-or underdetermined systems, and randomized preconditioners for overdetermined p regression problems (which involve minimizing ||b−Ax|| p for 1 ≤ p ≤ ∞) applied within Stochastic Gradient Descent-like solvers in [392]. • Probabilistic linear solvers: A further concept of considerable recent interest has been a Bayesian interpretation of preconditioning. A key early work in this direction is [393], where preconditioners for ill-posed systems of the form Ax=b (where A is of ill-defined rank) are obtained from the covariance matrix of the solution interpreted as a random variable. These were termed priorconditioners. Building on this, a statistical interpretation of left-and right-preconditioners was provided in [386], specifically connecting such preconditioners to covariance matrices of noise and prior, respectively, in certain settings. A priorconditioned CGLS method is applied to inverse problems from electrical impedance tomography in [394].
We also highlight research in [395,396] which provides novel derivations of CG methods, outputting probability measures which may be used to quantify uncertainty in the solution. In particular [395] extends probabilistic interpretations of secant methods and mentions the potential for preconditioning the derived method, and [396] discusses in detail preconditioners for the authors' BayesCG method, which are thought of as priors. As outlined in [397], the conceptual difference is that in [395] a posterior is constructed on A −1 whereas in [396] a posterior is constructed on the solution x. The work [397] unifies the two approaches presented in [395,396], provides an interpretation of leftand right-preconditioning, and extends the idea to the GMRES algorithm.
• Preconditioning and tensor methods: An important consideration when solving linear systems, for instance those arising from high-dimensional PDEs, is that of data storage on a computer, which can become prohibitive unless specialized schemes are devised in order to combat this. One area of interest is that of utilizing low-rank tensor decompositions alongside preconditioned solvers, for linear systems which possess some underlying tensor-product structure. Such techniques assume in general that the input data and solution vector are approximable within such a low-rank tensor format, for instance the hierarchical Tucker format (see [398,399]), the tensor train format (see [400]), and its extension to the quantized tensor train format (see [401,402]). Additionally, for linear systems with Kronecker product structure, for example arising from certain high-dimensional PDEs, it is also possible to use Krylov subspace methods designed for tensors or matrix equations [403,404]. Among many valuable works on tensor methods for the numerical solution of differential equations, coupled with preconditioners, we refer to [405][406][407][408] (see also [403] for the construction of Krylov subspace methods). Here, low-rank variants of iterative schemes are derived, where the iterates are replaced by low-rank tensors, and the potency of these schemes depends on the derivation of effective preconditioners. Additionally, recent research has been undertaken on tensor-based schemes for PDE-constrained optimization problems [409,410], as well as FDE-constrained optimization problems [411,412]. • High-performance computing: Some problems are so large and complex that they necessitate HPC and heterogeneous architectures. This brings new challenges for iterative methods: memory is usually limited and communication, particularly between nodes and processors, is slow relative to the speed of floating point operations. Since sparse matrix-vector products and vector operations are typically communication-bound, that is, their speed is limited by the speed of moving data, it is not straightforward to improve the efficiency of Krylov methods in HPC environments [413][414][415]. On the other hand, since Krylov methods rely on a few simple operations, there are opportunities to utilize, for example, graphics processing units (GPUs) to improve efficiency [416]. Although preconditioned iterative methods for HPC applications should also aim to solve the linear system as quickly as possible, rather than reducing the number of floating point operations the aim is typically to exploit parallelism and reduce communication costs. These goals can be achieved by improving data locality, thereby enabling more operations to be performed on data before communication is required, and/or by asynchronous communication and careful task scheduling, for example by using directed acyclic graphs to map data dependencies [417]. Unfortunately, some of the best preconditioners for serial computations, for example multigrid, are naturally communication bound as a result of coarse grid computations and grid transfer operations. Consequently, much work has gone into making these methods more suitable for HPC [418,419]. On the other hand, other methods including domain decomposition [89,90,[92][93][94][95] and sparse approximate inverse preconditioners [420,421] have long been associated with HPC because they traditionally require relatively little communication and are highly parallelizable. However, perhaps counterintuitively, newer versions of these algorithms may increase communication and/or reduce parallelism to improve overall efficiency on cutting-edge computer architectures. For example, domain decomposition methods may also include a multilevel component that induces global coupling [422], while SPAI preconditioners may, for example, increase sequential computation to improve efficiency [423]. Additionally, it may be advantageous to tailor kernels to GPUs (see, eg, [424]).
Though the construction and application of incomplete factorization preconditioners can be problematic on parallel architectures, performance can be improved by, for example, using an asynchronous iterative method to form the factors [425] and exploiting techniques from sparse approximate inverses [426]. Other preconditioners have been developed with extreme-scale computing in mind; these may require more computation than alternatives, and hence are slower for serial or moderately parallel computations, but are better able to utilize many-core architectures [242,427].

CONCLUSIONS
In this article we have given a short overview of the multitude of applications of preconditioning within scientific fields.
The key goal of preconditioned iterative solvers is to ensure the rapid and robust solution of linear systems of equations, particularly in cases where this cannot be achieved by a direct method or unpreconditioned iterative solver, however in a number of cases one may have additional objectives in mind, as in Section 5. Research has ranged from the design of preconditioners for matrices about which a number of specific properties are known, particularly in PDE applications (as in Section 3), or more general black-box solvers, especially in the study of optimization problems (as in Section 4). Of course, due to the aim of providing a brief, overarching view on the vast field of preconditioning, we have omitted discussions on a number of other application areas, for example network problems and new developments in data science. There does not yet exist a "universal" preconditioner, indeed well-constructed preconditioners tend to capture the properties arising from the particular application being considered, sometimes including the physics of the process itself. By employing this philosophy, many researchers have made significant contributions to the study of scientific processes through tackling the resulting linear systems.