Milestones in the Development of Iterative Solution Methods

Iterative solution methods to solve linear systems of equations were originally formulated as basic iteration methods of defectcorrection type, commonly referred to as Richardson’s iteration method. These methods developed further into various versions of splitting methods, including the successive overrelaxation (SOR) method. Later, immensely important developments included convergence acceleration methods, such as the Chebyshev and conjugate gradient iteration methods and preconditioning methods of various forms. A major strive has been to find methods with a total computational complexity of optimal order, that is, proportional to the degrees of freedom involved in the equation. Methods that have turned out to have been particularly important for the further developments of linear equation solvers are surveyed. Some of them are presented in greater detail.


Introduction
In many applications of quite different types appearing in various sciences, engineering, and finance, large-scale linear algebraic systems of equations arise.A particular type of problems appear in signal processing.This also includes nonlinear systems of equation, which are normally solved by linearization at each outer nonlinear iteration step, but they will not be further discussed in this paper.
Due to their high demand of computer memory and computer time, which can grow rapidly with increasing problem size, direct solution methods, such as Gaussian elimination, are in general not feasible unless the size of the problem is relatively small.In the early computer age, when available size of computer central memories was very small and the speed of arithmetic operations slow, this was found to be the case even for quite modest-sized problems.
Even for modern computers with exceedingly large memories and very fast arithmetics it is still an issue because nowadays one wants to solve much more involved problems of much larger sizes, for instance to enable a sufficient resolution of partial differential equation problems with highly varying (material) coefficients, such as is found in heterogeneous media.Presently problems with up to billions of degrees of freedom (d.o.f.) are solved.For instance, if an elliptic equation of elasticity type is discretized and solved on a 3D mesh with 512 meshpoints in each coordinate direction, then an equation of that size arises.
A basic iteration method to solve a linear system where A is nonsingular, has the following form.Given an initial approximation x 0 , for k = 0, 1, . . .until convergence, let r k = Ax k − b,e k = −τr k , x k+1 = x k + e k .Here, τ > 0 is a parameter to be chosen.This method can be described either as a defect (r k )correction (e k ) method or, alternatively, as a method to compute the stationary solution of the evolution equation by timestepping with time-step τ, that is, Such methods are commonly referred to as Richardson iteration methods (e.g., see [1][2][3][4]).However, already in 1823 Gauss [5] wrote, "Fast jeden Abend mache ich eine neue Auflage des Tableau, wo immer leicht nachzuhelfen ist.Bei der Einförmigkeit des Messungsgeschäfts gibt dies immer eine angenehme Unterhaltung; man sieht daran auch immer gleich, ob etwas Zweifelhaftes eingeschlichen ist, was noch wünschenswert bleibt usw.Ich empfehle Ihnen diesen Modus zur Nachahmung.Schwerlich werden Sie je wieder direct eliminieren, wenigstens nicht, wenn Sie mehr als zwei Unbekannte haben.Das indirecte Verfahren läßt sich halb im Schlafe ausführen oder man kann während desselben an andere Dingen denken."(Freely translated, "I recommend this modus operandi.You will hardly eliminate directly anymore, at least not when you have more than two unknowns.The indirect method can be pursued while half asleep or while thinking about other things.")It holds that or where e k = x − x k is the iteration error and x is the solution of (1).
Hence, e k = (I − τA) k e 0 , k = 0, 1, . . . .(6) For convergence of the method, that is e k → 0, the parameter τ must in general be chosen such that ρ := I − τA < 1, where • is a matrix norm, subordinate to the chosen vector norm.(We remark here that this is not possible if A is indefinite.) Let ρ(•) denote the spectral radius of a matrix, that is, the maximal absolute value of the eigenvalues of the matrix.
If A is self-adjoint, then it can be shown that ρ(A) = A 2 = ρ(A * A), where • 2 denotes the matrix norm subordinate to the Euclidian vector norm.For general, nonsymmetric matrices it has been shown (e.g., see [6, page 162]) that there exist matrix norms that are arbitrarily close to the spectral radius.These can, however, correspond to an unnatural scaling of the matrix.
The rate of convergence is determined by the convergence factor ρ.For symmetric positive definite matrices, the optimal value of τ to minimize ρ, is τ = 2/(λ 1 + λ n ), where λ 1 , λ n are the extreme eigenvalues of A. Normally, however, the eigenvalues are not available.
As an example, for second order elliptic diffusion type of problems in Ω d (d = 2, 3) using a standard central difference or a finite element method, the spectral condition number λ n /λ 1 = O(h −2 ), where h is the (constant) meshsize parameter.Hence, the number of iterations to reach a relative accuracy ε is of order O(h −2 )| log ε|), h −→ 0.
Since each iteration uses O(h −d ) elementary arithmetic operations, this shows that the total number of operations needed to reduce the error to a given tolerance is of order O(h −d−2 ).This is in general smaller than for a direct solution method when d ≥ 2, but still far from the optimal order, O(h −d ), that we aim at.
To improve on this, often a splitting of the matrix A is used.It is readily shown that for any initial vector, the number of iterations required to get a relative residual, r k / r 0 < ε, for some ε, 0 < ε < 1, is at most k it = ln(1/ε)/ ln(1/ρ) + 1 , where denotes the integer part.Frequently, ρ = 1 − cδ r , where c is a constant, r is a positive integer, often r = 2 and δ is a small number, typically δ = 1/n, which decreases with increasing problems size n.This implies, that the number of iterations is propotional to (1/δ) r , which number increases rapidly when δ −→ 0.
Let B = C −1 R. If B is known and B < 1, we can use the following theorem to get a test when the iteration error is small enough, that is, when to stop the iterations.Theorem 1.Let B < 1, B = C −1 R, and x m be defined by (7).Then, Proof.From (7) follows x m+1 − x m = B(x m − x m−1 ) and, by recursion, Note now that x m+p − x m = p−1 k=0 (x m+k+1 − x m+k ).Hence, by the triangle inequality and (9) Letting p → ∞ and noting that x m+p → x, (8) follows.
The basic iteration method (4) or the splitting methods in Section 2, can be improved in various ways.This will be the major topic of this paper.
Note first that application of the splitting in (7) requires in general that the matrix R is given in explicit form, which can make the method less viable.
The most natural way to improve (4) is to introduce an approximation C of A, to be used when the correction e k in (4) is computed.The relation e k = −τr k is then replaced by Ce k = −τr k .Such a matrix is mostly called preconditioner since, by a proper choice, it can significantly improve the condition number K of A, that is, where Clearly, in practice, the matrix C must be chosen such that the linear systems with C can be solved with relatively little expense compared to a solution method for A. In particular, the expense for C is much smaller than that for A using a direct solution method.
For badly scaled matrices A a simple, but often practically useful, choice of C is the (block) diagonal part D of A. Much more efficient choices will be discussed later in the paper.
Early suggestions to use such a matrix C can be found in papers by D'Yakonov [7] and Gunn [8].For an automatic scaling procedure, see [9] and references therein.
In the present paper, we will survey various choices of C which have proven to be useful in practice.The paper attempts to give a more personal account of the development of iterative solution methods.It is also not our ambition to present the present state-of-the-art but rather to describe the unfolding of the field.
In the remainder of the paper, we discuss, in order, methods based on splitting of the given matrix, the accelerated iterative methods of the Chebyshev and (generalized) conjugate gradient types, pointwise and block incomplete factorization preconditioning methods, symmetrized preconditioners of SSOR and ADI type, approximate inverses, and augmented subspace preconditioning methods.If space had allowed it, it would have been followed by presentation of geometric and algebraic multigrid methods, two-level and multilevel methods, elementwise preconditioning methods, and domain decomposition methods.Also, iteration error estimates and influence of rounding errors, and preconditioners for matrices of saddle point type would have been included.The paper ends with some concluding remarks.
The following relations will be used; if A = [a i j ], then A T = [a ji ] denotes the transpose of A, while A * = [a ji ], denotes the Hermitian transpose.

Splitting Methods
A comprehensive, early presentation of splitting methods, and much more on iterative solution methods, is found in Varga [10].Often there is a natural splitting of the given matrix as where C is nonsingular.This can be used to formulate an iterative solution method in the form Cx k+1 = Rx k + b, k = 0, 1, . . . .(13) This method converges if ρ(C −1 R) < 1.

Definition 1. (a)
A matrix C is said to be monotone if C is nonsingular and (c) a weak regular splitting [11], if C is monotone and (d) a nonnegative splitting [12], if C is nonsingular and The following holds, see, for example, [6].
A splitting method that became popular in the fifties is the SOR method.Here, A = D − L − U, where D is the (block) diagonal and L, U are the (block) lower and upper triangular parts of A, respectively.The successive relaxation method takes the form where ω / = 0 is a parameter, called the relaxation parameter.For ω = 1 one gets the familiar Gauss-Seidel method (Gauss 1823 [5] and Seidel 1814 [13]) and for ω > 1 the successive overrelaxation (SOR) method (Frankel 1950 [14] and Young 1950 [15]).
An optimal value of ω can be determined as follows.Assume then that A has property (A π ), that is, there exists a permutation matrix P such that PAP T is a block tridiagonal matrix.The following Lemma holds.
Lemma 1 (see [15]).Assume that A has property then, λ is an eigenvalue of L ω .
Proof.For a short proof, see [6].Then, the SOR method converges for any initial vector if and only if ρ(B) < 1 and 0 < ω < 2. Further, we have for which the asymptotic convergence factor is given as Proof.Fort a short proof, see [6].
The eigenvalues of C −1 A are in general complex, and for ω = ω opt it can be shown that they are distributed around a circle in the complex plane.This implies that the method can not be polynomially accelerated.(See Section 3 for a presentation of polynomial acceleration methods.)Further, the efficiency of the SOR method turns out to be critically dependent on the choice of ω.
A similar result as in Theorem 3 has been shown in [6], see also [17], that holds even if A does not have property (A π ), but is Hermitian.

Theorem 4. Let A be Hermitian and positive definite and let
where L = D −1/2 LD 1/2 , and let 0 < ω < 2.Then, where minimizes the upper bound of ρ(L ω ) and we have For a proof, see [6].
In Section 4, we present a symmetric version of the SOR method where acceleration is possible.

Accelerated Iterative Methods
In this section, the important Chebyshev and conjugate gradient iteration methods are presented.
Let e k = x − x k , the iteration error.Then, it follows from ( 25) that e k+1 = (I ) a polynomial of degree m having zeros at 1/τ k and satisfying P m (0) = 1.
We want to choose the parameters {τ k } such that e m is minimized.However, this would mean that in general the parameters would depend on e 0 , which is not known.Also the eigenvalues of C −1 A are not known.We then take the approach of minimizing e m / e 0 for all e 0 ; that is, we want to minimize P m (C −1 A)r 0 .

The Chebyshev Iterative Method.
In case the eigenvalues of C −1 A are real and positive and if a positive lower (a) and (b) an upper bound are known of the spectrum, then we see that {τ k } should be chosen such that max a≤λ≤b |P m (λ)| is minimized over all P m ∈ Π 0 m , that is, over the set of polynomials of degree m satisfying P m (0) = 1.
The solution to this min-max problem is well known, where which are the zeros of the polynomial.The corresponding method is referred to as the Chebyshev (one-step) acceleration method, see, for example, [10,18].It is an easy matter to show that This implies that if the number of iterations satisfies m ≥ ln ρ −1 ln(2/ε), that is, in particular if then e m / e 0 ≤ ε.
The disadvantage with this method is that to make it effective one needs accurate estimates of a and b, and we need to determine m beforehand (which, however, can be done by (29)).The method cannot utilize any special distribution of the eigenvalues in the spectrum (as opposed to the conjugate gradient method, see below).More important, however, is that this method is actually numerically unstable (similarly to an explicit time stepping method for initial value problems when the time steps are too large).This is due to the fact that I − τ k C −1 A is much larger than unity for several of the values τ k .However, one may prove that with some particular permutation of the parameters, their instability effect can be avoided.
There is an alternative to the choice (25).Namely, there exists a three term form of the Chebyshev acceleration method where Here, the parameters are chosen as Hence, we do not have to determine the number of steps beforehand.More importantly, it has been shown in [18] that this method is numerically stable.(For some related remarks, see [6]).A similar form of the method was proposed a long time ago, see Golub and Varga [19] and the references cited therein.
It is interesting to note that the parameters approach stationary values.If which is recognized as the parameter ω opt of the optimal SOR method (see Section 2).Young [20] has proven that the asymptotic rate of convergence is retained even if one uses the stationary values throughout the iterations.
For the case of complex eigenvalues of C −1 A with positive real parts and contained in an ellipse one may choose parameters similarly.See [6,21,22] for details.For comments on the optimaly of the method, see [23].For application of the method for nonsymmetric problems, see [6,24].
Perhaps the main thrust during the 1970 has been in using the conjugate gradient method as an acceleration method.Already much has been written on the subject; we refer to [25][26][27] for a historical account, to [18,[28][29][30][31][32][33] for an exposition of the preconditioned conjugate gradient PCG method and to [18,34] for a survey of generalized Given x (0) , ε initial guess and absolute or relative stopping tolerance Set and truncated gradient methods for nonsymmetric and indefinite matrix problems.
The advantage with conjugate gradient methods is that they are self adaptive; the optimal parameters are calculated by the algorithm so that the error in energy norm e l A 1/2 = {(e l ) T Ae l } is minimized.This applies to a problem where C and A are symmetric and positive definite (SPD) or, more generally, if C −1 A is similarly equivalent to an SPD matrix.Hence, there is no need to know any bounds for the spectrum.Since the method converges at least as fast as the Chebyshev method it follows that We describe now the conjugate gradient method.Thereby we follow the presentations in [29,35].

The Preconditioned Conjugate Gradient Method.
During the past 40 years or so, the preconditioned conjugate gradient method has become the major iterative solution method for linear systems of algebraic equations, in particular those arising in science and engineering.The author of these notes became interested in the method by the beginning of 1970 (cf.[18]).
The conjugate gradient algorithm to solve a system of linear equations the Ax = b, where A(n × n) is symmetric and positive definite, was originally introduced by Hestenes and Stiefel [25] in 1950.Before we discuss the properties of the CG method, we describe its implementation.Namely, the algorithm consists of the steps in Algorithm 1.
What one sees from a first glance is that the CG algorithm is quite simple.Each iteration consists of one matrix-vector multiplication, two vector updates and two scalar products.Apart from the initial guess x (0) (which can be taken to be the zero vector) and stopping tolerance, there are no other method parameters to be determined or tuned by the user.Thus, the method is easily programmable, cheap in terms of arithmetic operations and performs as a black box.
For some problems the standard (unpreconditioned) CG method performs impressively well and this can be explained by some particular properties of this powerful algorithm.
The CG method is best described as a method to minimize the quadratic functional over a set of vectors.If A is nonsingular, then we can rewrite f in the form so, minimizing the quadratic functional is equivalent to solving the system Ax = b.If A is singular and A −1 in ( 35) is replaced by a generalized inverse of A, then the above equivalence still holds if the minimization takes place on a subspace in the orthogonal complement to the null-space of A.
Given an initial approximation x (0) and the corresponding residual r (0) = Ax (0) − b, the minimization in the conjugate gradient method takes place successively on a subspace of growing dimension.This subspace is referred to as the Krylov set.
In the derivation of the algorithm, the next approximate solution is constructed as where τ k is chosen which minimizes the function f Also, the gradient of f at x (k+1) is made orthogonal to the search direction d (k) .This is seen from the following relations: As in Fourier type minimization methods, it turns out to be efficient to work with orthogonal (A-orthogonal) search directions d (k) which, since A is symmetric, can be determined from a three-term recursion or equivalently, from This recursive choice of search directions is done so that at each step the solution has smallest error in the A-norm, x − , where e (k) = x − x (k) is the iteration error.As mentioned, the minimization takes place over the set of (Krylov) vectors K k and, as is readily seen (1) − x (0) , x (2) − x (0) , . . ., x (k) − x (0)   = g (0) , g (1) , . . ., g (k−1) = d (0) , d (1) , . . ., d (k−1) .(42) To summarize, the CG method possesses the following remarkable properties.Theorem 5. Let the CG Algorithm 1 be applied to a symmetric positive definite matrix A. Then, in exact arithmetic the following properties hold: (1) the iteratively constructed residuals g are mutually orthogonal, that is, g (k) T g ( j) = 0, j < k; (2) the search directions d are A-orthogonal (or conjugate), that is, d (k) T Ad ( j) = 0, j < k: (3) as long as the method has not converged, that is, g (k) / = 0, the algorithm proceeds with no breakdown and (42) holds; (4) as long as the method has not converged, the newly constructed approximation x (k) is the unique point in x (0) ⊕ K k that minimizes e (k) A = x − x (k) A , (5) the convergence is monotone in A-norm, that is, e (k) A < e (k−1) A and e (m) = 0 will be achieved for some m ≤ n.
For a proof of the above theorem consult, for instance, [29] or [6].
Since the method is optimal, that is it gives the smallest error on a subspace of growing dimension, it terminates with the exact solution (ignoring round-off errors) in at most n steps (the dimension of the whole vector space x ∈ R n ).In fact, it can be readily seen that the CG algorithm terminates after m steps, where m is the degree of the minimal polynomial Q m to A with respect to the initial residual vector, in other words, Q m has the smallest degree of all polynomials Q for which Q(A)r (0) = 0. Therefore, the CG method can be viewed also as a direct solution method.However, in practice we want convergence to occur to an acceptable accuracy in much fewer steps than n or m.Thus, we use CG as an iterative method.
When one experiments with CG to solve systems with various matrices one observes some phenomena which need special attention.This can be illustrated by a simple example.
Consider the solution of Ax = b by the standard conjugate gradient, where The exact solution is x = [1, 1, . . ., 1] T .Starting with x (0) = [0, 0, . . ., 0] T one finds that after k iterations for 1 ≤ k ≤ n − 1 and x (n) = x.Hence, the information travels one step at a time from left to right and it takes n steps before the last component has changed at all.The algorithm converges exactly in n steps and terminates due to the final recurrence property of the method.
Another detail one observes is that the norm of the error, x−x (k) , can be much larger than the norm of the iteratively computed residual.
These examples illustrate the fact that although the method has an optimal order of convergence rate in the energy norm, its actual convergence rate in spectral norm can be different and depends both on the distribution of the eigenvalues of the (preconditioned) system matrix and on the initial approximation (or residual).(For comparison, we note that the rate of convergence for steepest descent depends only on the ratio of the extremal eigenvalues of A.) Faster convergence for the CG method is expected when the eigenvalues are clustered.
One way to get a better eigenvalue distribution is to precondition A by a proper preconditioner B. Hence, in order to achieve a better eigenvalue distribution it is crucial in practice to use some form of preconditioning, that is, a matrix B which approximates A in some sense, which is relatively cheap to solve systems with and for which the spectrum of is more favorable for the convergence of the CG method.As it turns out, if B is symmetric and positive definite, the corresponding preconditioned version, the PCG method, is best derived by replacing the inner product with (u, v) = u T Bv.It takes the following form, see Algorithm 2.
Here, [B] −1 denotes the action of B −1 , that is, one does not multiply with the inverse matrix B −1 , but normally solves a linear system with matrix B.
In order to understand what is wanted of a good preconditioning matrix, we discuss first some issues of major importance related to the rate of convergence of the CG method.Thereby it becomes clear that the standard spectral condition number is often too simple to explain the detailed convergence behaviour.In particular we discuss the sub-and superlinear convergence phases frequently observed in the convergence history of the conjugate gradient method.
A preconditioner can be applied in two different manners, namely, as B −1 A or BA.The first form implies the necessity to solve a system with B at each iteration step while the second form implies a matrix-vector multiplication with B (a multiplicative preconditioner).In the latter, case B can be seen as an approximate inverse of A. One can also use a hybrid form αB −1 1 + βB 2 .The presentation here is limited to symmetric positive semidefinite matrices.It is based mainly on the articles [29,31].

On the Rate of Convergence Estimates of the Conjugate Gradient Method.
Let A be symmetric, positive semidefinite and consider the solution of Ax = b by a preconditioned conjugate gradient method.In order to understand how an efficient preconditioner to A should be chosen we must first understand some general properties of the rate of convergence of conjugate gradient methods.

Rate of Convergence Estimates Based on Minimax Approximation.
As is well known (see e.g., [6,30]), the conjugate gradient method is a norm minimizing method.For the preconditioned standard CG method, we have where u A = {u T Au} 1/2 , e k = x − x k is the iteration error and π k denotes the set of polynomials of degree k which are normalized at the origin, that is, P k (0) = 1.This is a norm on the subspace orthogonal to the mullspace of A, that is, on the whole space, if A is nonsingular.Consider the C-innerproduct (u, v) = u T Cv, and note that B = C −1 A is symmetric with respect to this innerproduct, let v 1 , v 2 , . . ., v n be orthonormal eigenvectors and let λ i , i = 1, . . ., n be the corresponding eigenvalues of B. Let be the eigenvector expansion of the initial vector where α j = (e 0 , v i ), i = 1, . . ., n.Note further that the eigenvectors are both Aand C-orthogonal.Then, by the construction of the CG method, it follows and, using the nonnegativity of the eigenvalues, we find Here we have used the A-orthogonality of the eigenvectors.
Similarly, using the C-orthogonality, we find Due to the minimization property (45) there follows from (48) the familiar bound Estimate ( 50) is sharp in the respect that for every k there exists an initial vector for which equality is attained.In fact, for such a vector we necessarily have that α j / = 0 if and only if α j belongs to a set of k + 1 points ( the so-called Haar condition) where max i |P k (λ i )| is taken.For such an initial vector (49) shows that, if the eigenvalues are positive, we have also The rate of convergence of the iteration error e k A is measured by the average convergence factor Inequality (50) shows that this can be majorized with an estimate of the rate of convergence of a best polynomial approximation problem (namely the best approximation of the function ≡ 0, of polynomials in π 1 k ) in maximum norm on the discrete set formed by the spectrum of B. Clearly, multiple eigenvalues are treated as single so the actual approximation problem is min where the disjoint positive eigenvalues λ j have been ordered in increasing value, 0 and m is the number of such eigenvalues.However, the solution of this problem requires knowledge of the spectrum, which is not available in general.Even if it is known, the estimate ( 53) can be troublesome in practice, since it involves approximation on a general discrete set of points.
Besides being costly to apply, such estimates do not give any qualitative insight in the behaviour of the conjugate gradient method for various typical eigenvalue distributions.
That is why we make some further assumptions on the spectrum in order to simplify the approximation problem and at the same time, present estimates which can be used both to estimate the number of iterations and to give some insight in the qualitative behaviour of the iteration method.

Standard Condition Number: Linear Convergence. If the eigenvalues are (densely) located in an interval [a, b]
where a > 0, we can majorize the best approximation problem on the discrete set with the best approximation problem on the interval and frequently still get a good estimate.We have min The solution to this min max problem is well known and uses Chebyshev polynomials.One finds that min (55) where Hence, the average rate of convergence of the upper bound approaches σ as k → ∞.Also, it is readily found (see [35]) that the relative iteration error e k A / e 0 A ≤ ε if Here ξ denotes the smallest integer not less than ξ.It turns out that the above holds more generally if A is nonsymmetric but the eigenvalues are contained in an ellipse with foci a, b, where b ≥ a > 0, if one replaces σ with σ = σ (1 + δ)/(1 − δ), where δ is the eccentricity of the ellipse, (i.e., the ratio of the semiaxes) and δ < 2 Also, in a similar way, the case of eigenvalues contained in two separate intervals or ellipses can be analysed, see, for example, [35] for further details.
When b/a → ∞, δ = 0, and ε → 0 the following upper bound becomes an increasingly accurate replacement of (56), The above estimate of the rate of convergence and of the number of iterations show that they depend only on the condition number b/a and on the eccentricity of the ellipse, containing the eigenvalues.Therefore, except in special cases, this estimate is not very accurate.When we use a more detailed information of the spectrum and the initial error vector, sometimes substantially better estimates can be derived.This holds for instance when there are well separated small and/or large eigenvalues.Before we consider this important case, we mention briefly another similar minimax result which holds when we use different norms for the iteration error vector and for the initial vector.By (48), we have If the initial vector is such that Fourier coefficients for the highest eigenvalue modes are dominating, then e 0 A 1−2s may exist and take not too large values even for some s ≥ 1/2.We consider the interesting case where s ≥ 1/2, for which the following theorem holds (see [6,36]).
Theorem 6.Let π 1 k denote the set of polynomials of degree k such that P k (0) = 1.Then for k = 1, 2 . . .and for any s ≥ 1/2 such that 2s is an integer, it holds for where T k (x) and U k (x) are the Chebyshev polynomials of kth degree of the first and second kind, respectively.
For other values ( 59) is an upper bound only, that is, not sharp.At any rate, it shows that the error e k A converges (initially) at least as fast as (s/(k + s)) 2s , that is, as 1/(2k + 1) for s = 1/2 and as (1/(k + 1)) 2 for s = 1.
Note that this convergence rate does not depend on the eigenvalues, in particular not on the spectral condition number.

Conclusion 1.
By computing the initial approximation vector from a coarse mesh, the components for e 0 for the first Fourier modes will be small and e 0 A 1−2s may take on values that are not very large even when s = 1 or bigger.Therefore, there is an initial decay of the residual as O(k 2s ), independent of the condition number.Note, however, that the actual errors may not have decayed sufficiently even if the residual has.
We consider now upper bound estimates which show how the convergence history may enter a superlinear rate.

An Estimate to Show a Superlinear Convergence Rate Based on the K-Condition Number.
A somewhat rough, but simple and illustrative superlinear convergence estimate can be obtained in terms of the so called K-condition number, (see [37,38]) where we assume that B is s.p.d.Note that K 1/n equals the quotient between the arithmetic and geometric averages of the eigenvalues.This quantity is similar to the spectral condition number κ(B) in that it is never smaller than 1, and is equal to 1 if and only if B = αI, α > 0 (recall that B is symmetrizable).
Based on the K-condition number, a superlinear convergence result can be obtained as follows.
Theorem 7. Let k < n be even and k ≥ 3 ln K.Then, Proof.Let k = 2m and the polynomial P k be of a simplest possible form, that is, let it vanish at the m smallest and m largest eigenvalues of B. As follows from (48), we have then The latter follows from (Π m i=1 (1 2 .Using now twice the inequality between the arithmetic and geometric mean values, one has Using exp (2x A somewhat better result of the same type exists.The estimate is of similar type, that is, where r k = Ae k was obtained using more complicated techniques, see [6,37], and the references quoted therein.Note that here as k/ ln K → ∞, we have that is, the simpler upper bound in (63) is asymptotically worse than this bound (albeit in a different norm) by the factor 3 k/2 / ln K 1/4 .The upper bounds in the above estimates involve a convergence factor which decreases with increasing iteration number and show hence a superlinear rate of convergence.Note, however, that K 1/n ≤ κ(B) º 4K (see [6]) where κ(B) = λ n /λ 1 is the spectral condition number, so the Kcondition number may take very large values when κ(B) is large.
The estimates based on the K-condition number involve only "integral" characteristics of the preconditioned matrix (the trace and the determinant).Sometimes, it is possible to obtain a practical estimate of K(B) which can be useful for the a priori construction of good preconditioners and for the a posteriori assessment of their quality, see Section 5 for further details.
The estimate shows that which holds if Hence, when K ε −2 the estimated number of iterations depends essentially only on log 2 K, that is, depends little on the relative accuracy ε, which indicates a fast superlinear convergence, when k > log 2 K.
When actually estimating the number of iterations, Theorem 6 shows a useful result only when k > O(ln K) = n(ln K 1/n ), that is, the quotient between the arithmetic and geometric averages of the eigenvalues, which equals K 1/n , must be close to unity and the eigenvalues must be very well clustured so K 1/n = 1 + O(n −ε ) for some ε > 0; otherwise the estimated number of iterations will be O(n), which is normally a useless result.The next example illustrates this further.
Example 1.Consider a geometric distribution of eigenvalues of A, λ j = j s , j = 1, 2, . . ., n for some positive s.Here, asymptotically Using Stirling's formula, we find so, Hence, s must be sufficiently small for the estimate in Theorem 6. to be useful.On the other hand, the spectral condition number (i) κ(A) = n s , and the simple estimate based on κ(A) leads to k ∼ O(n s/2 ) and gives hence, asymptotically, a smaller upper bound when s < 2. For further discussions on superlinear rate of convergence, see [39].

Generalized Conjugate Gradient Methods.
The rate of convergence estimates as given above, holds for a restricted class of matrices, symmetric or, more generally, for normal matrices.
To handle more general classes of problems for which such optimal rate of convergence results as in (45) holds, one needs more involved methods.Much work has been devoted to this problem.This includes methods like generalized minimum residual (GMRES), see Saad and Schultz [40], generalized conjugate residual (GCR), and generalized conjugate gradient (GCG), see [6] and for further details [41].As opposed to the standard conjugate and gradient method, they require a long version of updates for the search directions, as the newest search direction at each stage is not in general, automatically (in exact precision) orthogonal to the previous search directions, but must be orthogonalized at each step.This makes the computational expense per step grow linearly and the total expense grows quadratically with the iteration index.In addition, due to finite precision, there is a tendency of loss of orthogonality, even for symmetric problems when many iterations are required.One remedy which has been suggested is to use the method only for a few steps, say 10, and restart the method with the current approximation as initial approximation.
Clearly, however, in this way, the optimal convergence property of the whole Krylov set of vector is lost.For this and other possibilities, see, for example, [42].
Another important version of the generalized conjugate gradient methods occurs when one uses variable preconditioners.Variable preconditioners, that is, a preconditioner changed from one iteration to the next iteration step, are used in many contexts.
For instance, one can use variable drop tolerance, computed adaptively, in an incomplete factorization method (see Section 4).When the given matrix is partitioned in two by two blocks, it can be efficient to use inner iterations when solving arising systems for one, or both, of the diagonal block matrices, see, for example, [43], and the flexible conjugate gradient method in Saad, [44,45].
Due to space limitations, the above topics will not be further discussed in this paper.

Incomplete Factorization Methods
There exist two classes of preconditioning methods that are closely related to direct solution methods.In this paper, we make a survey only of their main ingredients, but delete many of the particular aspects.
The first method is based on incomplete factorization were some entries arising during a triangular factorization are neglected to save in memory.The deletion can be based on some drop tolerance criterion or on a normally a priori, chosen sparsity pattern.The factorization based on a drop tolerance takes the following form.During the elimination (or equivalently, triangular factorization), the off-diagonal entries are accepted only if they are not too small.For instance, Here, ε, 0 < ε 1 is the drop-tolerance parameter.Such methods may lead to too much fill-in (i.e., a i j / = 0 in positions where the original entry was occupied by a zero), because to be robust, they may require near machineprecision drop tolerances.Furthermore, as direct solution methods, they are difficult to parallelize efficiently.
The incomplete factorization method can readily be extended to matrices partitioned in block form.Often, instead of a drop tolerance, one prescribes the sparsity pattern of the triangular factors in the computed preconditioner, that is, entries arising outside the chosen pattern are ignored.An early presentation of such incomplete factorization methods was given by Meijerink and van der Vorst [46].One can make a diagonal compensation of the neglected entries, that is add them to the diagonal entries in the same row, possibly first multiplied by some scalar Θ,0 < Θ ≤ 1.For discussions of such approaches, see [29,30,47,48].This frequently moves small eigenvalues, corresponding to the smoother harmonics, to cluster near the origin, in this way sometimes improving the spectral condition number by an order of magnitude (see [6,47]).
The other class of methods are based on approximate inverses G, for instance such that minimizes a Frobenius norm of the error matrix I − GA, see Section 5 for further details.To be sufficiently accurate these methods lead frequently to nearly full matrices.This can be understood as the matrices we want to approximate are often sparse discretizations of diffusion problems.The inverse of such an operator is a discrete Green's function which, as wellknown, often has a significantly sized support on nearly the whole domain of definition.
However, we can use an additive approximation of the inverse involving two, or more, terms which is approximate on different vector subspaces.By defining in this way the preconditioner recursively on a sequence of lower dimensional subspaces, it may preserve the accurate approximation property of the full, inverse method while still needing only actions of sparse operators.
Frequently, the given matrices are partitioned in a natural way in a two by two block form.For such matrices, it can be seen that the two approaches are similar.Consider namely where we assume that A 1 and the Schur complement matrix S = A 2 − A 21 A −1 1 A 12 are nonsingular.(This holds, in particular, if A is symmetric and positive definite.)We can construct either a block approximate factorization of A or approximate the inverse of A on additive form.As the following shows, the approaches are related.First, a block matrix factorization of A is where I 1 , I 2 denote the unit matrices of proper order.For its inverse, it holds or A straightforward computation reveals that A V ≡ V T A V = S and, hence, where Let M 1 A 1 be an approximation of A 1 ( for which linear systems are simpler to solve than for A 1 ) and let G 1 A −1 1 be a sparse approximate inverse.Possibly G 1 = M −1 1 .Then, is a preconditioner to A and is an approximate inverse, where where Hence, a convergence estimate for one method can be directly applied for the other method as well.For further discussions of block matrix preconditioners, see, for example, [49][50][51][52].As can be seen from the above, Schur complement matrices play a major role in both matrix factorizations.
We consider now multilevel extensions of the additive approximate inverse subspace correction method.It is illustrative to consider first the exact inverse and its relation to Gaussian (block matrix) elimination.

The Exact Inverse on Additive Form.
Let then A (0) = A and consider a matrix in a sequence defined by where each A (k) 1 in nonsingular, being a block diagonal of a symmetric positive definite matrix.Hence, the following recursion holds is the identity matrix on level k + 1.Note that in this example the dimensions decrease with increasing level number and the final matrix (i.e., Schur complement) in the sequence is . The above recursion can be rewritten in compact form where the kth column of the block upper triangular matrix 2 and L = U T .Further, A (2)   . . .
Hence, this is the (block matrix) Gaussian elimination method applied directly to form the inverse matrix.In this way, there is no need to first form the factorization A = LD U and then A −1 = U −1 D −1 L −1 .As is wellknown and readily seen, the columns of U −1 and L −1 are formed directly with no additional computation, from those of U and L, respectively.Note that U −1 is upper (block) triangular and L −1 is lower (block) triangular).
The matrix D contains the (block) pivot matrices which arise during the factorization.Permutations to increase the stability by finding dominating pivots can be done by replacing A (k+1) with P (k) T A (k+1) P (k) where A (k+1) = P (k) A (k+1) P (k) T is the permuted matrix on which the next elimination step takes place.
An incomplete factorization method for approximate inverses can be defined by approximating each arising Schur complement matrix with some sparse matrix B (k+1) and possibly also approximating 1 , to yield the approximate inverse where 12 In forming the approximate Schur complement one can use a simpler matrix D (k)  1 than B (k) 1 , often a diagonal matrix suffices.The intermediate Schur complement matrix 12 can be possibly further approximated by deleting certain off-diagonal entries to preserve sparsity.These entries can be compensated for by modifying the diagonal of S (k) 2 to form the final approximation B (k+1) .Thereby, it can be important to make the approximate Schur complement exact on some particular vector or vector space.(We do not go into these aspects further here, see [43,56] for details.) The eigenvectors for the smallest eigenvalues of A provide efficient column vectors for the matrix V to reduce significantly the condition number of B A as compared to that of A. However, in general the eigenvectors are not known, and even if they are known it would be costly to apply the corresponding preconditioner as V would be a full matrix.A more viable choice is to let V be defined by the basis functions {ϕ (H)  i } of a coarse mesh (or coarsened matrix graph) so that V , V T acts then, respectively, as prolongation and restriction operators and where J 12 is the interpolation matrix from the coarse mesh (Ω H ) to the fine mesh (Ω h ), and we assume Ω H ⊂ Ω h .
Further, letting the matrices be variationally defined, as in a finite element method, we have where A is the finite element matrix on the fine mesh.Now, the eigenvectors for the smaller eigenvalues of A H are normally accurate approximations of the corresponding eigenvectors for A. Furthermore, the eigenvectors of A H are members of Im V .Therefore, the matrix V in (94) acts nearly as well as the eigenvector matrix but, in addition, is sparse.Hence the approximate inverse takes the form where Here, the projection matrix projects vectors on the subspace Im V , containing normally good approximations of the eigenvectors for the smallest eigenvalues of A, that is, those who may cause severe illconditioning.Clearly, the approximation is more accurate as closer Ω H is to Ω h .However, the cost of the action of P (mainly the coarse mesh solver for the action of A −1 H ) increases when Ω H expands.One can balance Ω H to Ω h in order to let the action of P involve the same order of computational complexity as an action of A, that is, O(h −2 ) for a sparse matrix A. Assuming that the cost of an action of ) in a 2D diffusion problem (e.g., using a modified incomplete factorization method as preconditioner for the conjugate gradient method), we find H = h 4/5 .The number of outer iterations with preconditioner B depends also on the choice of G.We refer the discussion of how G can be chosen properly to [56].
As an example, for a model diffusion problem with constant coefficients on a regular mesh, say for the Laplacian operator on unit square, the eigenvectors for the Laplacian (−Δ) on Ω h with Dirichlet boundary conditions are where k, l = 1, 2, . . ., h −1 − 1, for the eigenvalues For the coarse mesh, it holds where k, l, = 1, 2, . . ., H −1 , and here Vv (H)   k,l are good approximations (interpolants) of the eigenvectors v (h)  k,l on Ω h for the smallest eigenvalues.
An alternative choice of matrix V is to take eigenvectors from a nearby problem, normally defined by taking limit values of some problem parameter, see [56].
Multigrid, algebraic multilevel and algebraic multigrid methods have been presented thoroughly in, for example [29,43,57,58].Because of space limitations, they can not be presented here.

Symmetrization of Preconditioners; the SSOR and ADI
Methods.As we have seen, the incomplete factorization methods require first a factorization step.There exists simpler preconditioning methods that require no factorization but have a form similar to the incomplete factorization methods.We will present two methods of this type.As an introduction, consider first an iterative method of the form to solve Ax = b, where A and M are nonsingular.As we saw in Section 2, the asymptotic rate of convergence is determined by the spectral radius of the iteration matrix For a method such as the SOR method (which also requires no factorization), with optimal overrelaxation parameter ω (assuming that A has property A π or A is s.p.d., see Section 2), the eigenvalues of the corresponding iteration matrix B are situated on a circle.No further acceleration is then possible.There is, however, a simple remedy to this, based on taking a step in the forward direction of the chosen ordering, followed by a backward step, that is, a step in the opposite order to the vector components.This method is said to have its origin in the early days of computers when programs were stored on tapes that had to be rewound before a new forward SOR step could begin.It was found that this otherwise useless computer time for the rewinding could be better used for a backward SOR sweep!As we will see, for symmetric and positive definite matrices the combined forward and backward sweeps correspond to a s.p.d.matrix which, contrary to the SOR method, has the advantage that it can be used as a preconditioning matrix in an iterative acceleration method.This method, called the SSOR method, will be defined later.
For an early discussion of the SSOR method, used as a preconditioner, see [59].For discussions about symmetrization of preconditioners, see [6,60,61].More generally, if A is s.p.d, we consider the symmetrization of an iterative method in the form For the analysis only, we consider the transformed form of (103), where If M is unsymmetric, the iteration matrix is also unsymmetric.We will now consider a method using M and another preconditioner chosen so that the iteration matrix for the combined method becomes symmetric.We call this the symmetrization of the method.
Let M 1 , M 2 be two such preconditioning matrices.Let and consider the combined iteration matrix B 2 B 1 .As we will now see, it arises as an iteration matrix for the combined method For the analysis only, we transform this to the form where This iteration takes the form that is, or where For the following we need a lemma.

Lemma 2. If A, B, and C are Hermitian positive definite and each pair of them commute, then ABC is Hermitian positive definite.
Proof.We have (ABC) * = CBA and use commutativity to find Hence, ABC is Hermitian.Next, we show that the product of two s.p.d matrices that commute is positive definite.We have which is Hermitian positive definite.Hence, by similarity, the eigenvalues of AB are positive and, since AB is Hermitian positive definite.In the same way, (AB)C is Hermitian positive definite.

Lemma 3. Let A be s.p.d. and assume either of the following additional conditions:
(a) , and the pair of matrices M 1 , M 2 , commutes.
Then, the combined iteration method (107) converges if and Proof.It is readily seen that y = b + B 2 B 1 y (i.e., the iteration method is consistent with Ax = b ), where y = A 1/2 x and x = A −1 b.Hence, and the iteration method (112), and hence (107) converges for any initial vector if and only if ρ(B 2 B 1 ) < 1, where ρ(•) denotes the spectral radius.But It is readily seen that under either of the given conditions (a) or (b), is symmetric.Further, it is positive definite if and only if ) is symmetric, and a similarity transformation shows that B 2 B 1 is similar to , which is a congruence transformation of I − M −1 1 , whose eigenvalues are positive.Hence, B 2 B 1 has positive eigenvalues, so the eigenvalues of B 2 B 1 are contained in the interval (0, 1) and, in particular, ρ(B 2 B 1 ) < 1.
The proof of Lemma 3 shows that B 2 B 1 is symmetric, so the combined iteration method is a symmetrized version of either of the simple methods.
Let us now consider a special class of symmetrized methods.We let A be split as A = D + L + U, where we assume that D is s.p.d., and let where ω is a parameter, 0 < ω < 2. (Here, L and U are not necessarily the lower and upper triangular parts of A.) Note that so this is also a splitting of A. As an example of a combined, or symmetrized, iteration method, we consider the preconditioning matrix and show that this leads to a convergent iteration method This corresponds to choosing , and it can be seen that the conditions of Lemma 3 hold if the conditions in the next theorem hold.Proof.This can be shown either by verifying the conditions in Lemma 3 or more directly as follows.As in the proof of Lemma 3, it follows that C is s.p.d.Hence, the eigenvalues of C −1 A are positive.Further, so, by (121) Under either condition (a) or (b), C = V D −1 H is symmetric and positive semidefinite.This shows that x * Cx ≥ x * Ax for all x, so the eigenvalues of C −1 A are bounded above by 1.
We will now show that the matrix C can also efficiently be used as a preconditioning matrix, which for a proper value of the parameter ω, and under an additional condition, can even reduce the order of magnitude of the condition number.In this respect, note that when C is used as a preconditioning matrix for the Chebyshev iterative method, it is not necessary to have C scaled so that λ(C −1 A) ≤ 1, because it is suffices then that 0 < m ≤ λ(C −1 A) ≤ M, for some numbers m, M. Hence, the factor 2/ω − 1 in D −1 can be neglected. where Further, if there exists a vector for which x T (L + U)x T (L + U)x ≤ 0, then γ ≥ −1/4, and if then γ ≤ 0, and if Here, Proof.It is readily seen that This shows the lower bound in (127); the upper bound follows by Theorem 8.By choosing a vector for which x T (L + U)x ≤ 0, it follows that which shows γ ≥ −1/4.The remainder of the theorem is immediate.

The Condition Number.
Theorem 9 shows that the optimal value of ω to minimize the upper bound of the condition number of C −1 A is the value that minimizes the real-valued function It is readily seen (see Axelsson and Barker, 1984 [30]), that f (ω) is minimized for In general, δ is not known, but we may know that δ = O(h 2 ), for some problem parameter, h → 0 (such as for the step length in second-order elliptic problems).Then, if γ = O(1), h → 0, we let ω = 2/(1 + ξh) for some ξ > 0, in which case This means that C −1 A has an order of magnitude smaller condition number than A itself, which latter is O(δ −1 ).
We consider now two applications of Theorem 9.

The SSOR Method.
In the first case, L is the lower triangular part of A or the lower block triangular part, if A is partitioned in block matrix form and U = L * .Then, is a symmetrized version of the SOR method and is called the SSOR (symmetric successive overrelaxation) method.
As an example, for an elliptic differential equation of second order it can be seen that the condition ρ( L L T ) ≤ 1/4 holds for problems with Dirichlet boundary conditions and constant coefficients.For extensions of this, see Axelsson and Barker [30].For the model difference equation on a square domain with side π, we have and we find 4.3.The ADI Method.In the second case of methods of (101), we let L denote the off-diagonal part of the difference operator working in the x-direction and U off-diagonal part of the difference operator in the y-direction.D is its diagonal part.Then, the matrix is called an alternating direction preconditioning matrix and the corresponding iteration method is called the ADI (alternating direction iteration) method.In this method, we solve alternately one-dimensional difference equations in xand y-directions.Much has been written on the ADI-method which was originally presented in Peaceman and Rachford [62]; see Varga [10], for an early influential presentation and Birkhoff et al. [63] and Wachspress [64], for instance.As we will see, for the model difference equations we get the same optimal value of ω as in (138).The condition γ = O(1) may be less restrictive, for the ADI-method, but the condition of commutativity is much more restrictive, as the following lemma shows.

Lemma 4. Let A, B be two Hermitian matrices of order n.
Then AB = BA if and only if A and B have a common set of orthonormal eigenvectors.
Proof.If such a common set of eigenvectors {v i } exists, then Since the eigenvector space of an Hermitian matrix is complete, we therefore have which shows that AB = BA.Conversely, suppose that AB = BA.As A is Hermitian, take U to be a unitary matrix that diagonalizes A, that is where γ 1 < γ 2 < • • • < γ r are the distinct eigenvalues of A and I j is the identity matrix of order n j , the multiplicity of γ j .(Here A is posssibly permuted accordingly.)Let B = U BU * and partition B corresponding to the partitioning of A, that is, Since AB = BA, we have Carying out the block multiplication A B = B A, we find that this, in turn, implies B i j = 0, i / = j, since γ i / = γ j , i / = j.Simply stated, a (block) matrix commutes with a (block) diagonal matrix if and only if it is itself (block) diagonal.Hence, B is block diagonal and each Hermitian submatrix B i,i has n i orthonormal eigenvectors that are also eigenvectors of the submatrix γ i I i of A. Since r i=1 n i = n and all eigenvectors are orthonormal, A and B must have the same set of eigenvectors.
For the second-order elliptic difference equation in two space dimensions, it turns out that the commutativity of L and U essentially corresponds to the property that the original problem is separable, that is, that solutions of Lu = f can be written in the form u = ϕ(x)ψ(y).This means that the coefficients a(x, y) and b(x, y) in the differential operator ∂/∂x[a(x, y)∂u/∂x] + ∂/∂y[b(x, y)∂u/∂x] + c(x, y)u must satisfy a(x, y) = a(x),b(x, y) = b(y), and c(x, y) = c, a constant.Hence, if a(x, y) = b(x, y), then a(x, y) = b(x, y) = a, a constant.Furthermore, the convex closure of the meshpoints must be a rectangle with sides parallel to the coordinate axes (Varga,[10] 1962).If A = A 1 + A 2 , the ADImethod can be written in the form This is the Peaceman-Rachford [62] iteration method.The iteration matrix M is similar to When A 1 , A 2 are Hermitian positive definite, their eigenvalues λ (1)  i , λ (2)  i are positive, and Thus, Note that for τ 1 = τ 2 = τ > 0, μ(τ 1 , τ 2 ) = μ(τ, τ) < 1, so we have ρ(M) < 1, that is convergence for any τ > 0. This holds even if A 1 , A 2 do not commute.Note also that when A 1 and A 2 commute, we have Let us continue the analyses for the general case where A 1 , A 2 do not necessarily commute.We want to compute the optimal values of τ 1 and τ 2 such that μ(τ 1 , τ 2 ) is minimized.For simplicity, we assume that α, β are the same lower and upper bounds of the eigenvalues of We want to choose τ 1 , τ 2 > 0 such that this bound is as small as possible.Note, then, that for such values of τ 1 , τ 2 we must have 1 − τ i α > 0 and 1 − τ i β < 0. Next note that each factor in the bound (150) is minimized when respectively, that is, when respectively.Thus, both factors are simultaneously minimized when τ 1 = τ 2 , and then τ 1 = τ 2 = 1/ αβ.

Theorem 10. Let
Remark 2. For a model difference equation for a secondorder elliptic differential equation problem on a square with side π, we have with stepsize h, Then, Note that this is just the convergence factor we get for the SOR method with an optimal overrelaxation parameter.
Since ρ(M) ≤ μ(τ 1 , τ 2 ), this means that the ADI method with parameters (chosen as above) converges at least as fast as the SOR method.Note, however, that in the ADI-method we must solve two systems of equations with tridiagonal coefficient matrices (I − τA i ) on each step, while the pointwise SOR method requires no solution of such systems.

The Commutative Case. Assume now that A
Then, as we have seen, M is symmetric and has real eigenvalues, and we can apply the Chebyshev acceleration method.The eigenvalues of the corresponding preconditioned matrix C are related to the eigenvalues of M by where we have The asymptotic rate of convergence of the Chebyshev accelerated method therefore is For the model difference equation, we have the asymptotic rate of convergence 4.5.The Cyclically Repeated ADI Method.The real power of the ADI method is brought forth when we use a sequence of parametrs τ l .Assume that A 1 , A 2 commute, then we choose the parameters τ l cyclically.With a cycle of q parameters and with the assumption we get the iteration matrix The eigenvalues of M (q) are q p=1 1 − τ p λ (1)   i In the same way as above, ρ(M (q) ) is minimized when

A Preconditioning Method for Complex Valued Matrices.
Complex valued systems of equations arise in many applications.A commonly occuring case is the solution of a matrix polynomial equation where A is a real square matrix and Q m is a polynomial of degree m that has no zeroes at the eigenvalues of A.Here Q m can be factored in the product of second degree, and possibly some factors of first degree polynomials with real coefficients.
The second degree polynomials can be factored in products of first degree polynomials with complex coefficients.
Consider then a linear system in the form where R, S are real matrices of order n and x, y, c, d ∈ R n .
The system can be solved in complex arithmetic.However, complex arithmetic leads to heavier computational complexity and it is in general difficult to precondition complex valued matrices, as the eigenvalues of the given matrix or the preconditioned matrix can be spread in the whole complex plane and the iterative solution method may then converge too slowly.
One can alternatively apply a preconditioned conjugate gradient method to the Hermitian positive definite normal matrix system A H Au = A H b for which the eigenvalues are real.At any rate, this involves complex arithmetic that costs typically three to four times as much as corresponding real arithmetic.
Complex arithmetic can be avoided by rewriting (168) in real valued form, such as or The block matrices are here real but, in general, nonsymmetric and/or indefinite.For the solution, one can use a generalized conjugate gradient method such as GMRES [40] or GCG [65].
For A (1) it holds that any eigenvalue λ appears in complex conjugate pairs λ, λ.For A (2) , which is real symmetric, for any eigenvalue λ / = 0, −λ is also an eigenvalue.Thus, the spectrum σ(A (1) ) is symmetric with respect to the real axes and the spectrum σ(A (2) ) is symmetric with respect to the imaginary axes, that is, in both cases the spectrum embraces the origin.From best polynomial approximation properties it is known that such point distributions leads to polynomials of essentially a square degree as for the same approximation accuracy compared to the case with a one-sided spectrum.
In [21], one finds further explanations why Krylov subspace methods can be inefficient for solving complex valued systems, represented in the above real forms.Several iterative solution methods, such as the QMR [22] have been developped and proven to be efficient for these types of problems.However, it is difficult to precondition complex valued matrices and unpreconditioned methods converge in general very slowly.
Following [66], we consider here instead an approach based on rewriting the equation in the form (170).
Instead of solving the full block matrix system we apply a Schur complement approach by the elimination of one component, which results in the following reduced system where As an introduction, assume first that R is symmetric and positive definite and S is symmetric and positive semidefinite.As a preconditioner to the matrix C in (171) we take R + S.
For the generalized eigen value problem, it holds then where μ = R 1/2 z and If λRz = Sz, z / = 0, or, equivalently, H y = λy, y / = 0, it follows from (174) that that is, Since, by assumption λ ≥ 0, it follows and the condition number satisfies the bound The correspondingly preconditioned conjugate gradient method to solve (174) converges therefore rapidly There exists an even more efficient form of the iteration matrix that also shows that we can weaken the assumptions made on R and S. Hence, consider and assume that α is a real parameter such that R + αS is nonsingular.Such a parameter exists if ker (R) ∩ ker (S) = ¡ 0.
Multiplying the first equation by −α(R + αS) −1 , the second by (R + αS) −1 and adding yields Now multiplying this equation by S, using Sy = Rx − c, and rewriting the equation properly, we find When solving the system by iteration, such as by Chebyshev iterations, r will be the residual and we observe that r can be written in the form (see ( 179)) In this form there is no need to compute the right hand side vector f initially as if ( 168) is used and the vector y is found during the iteration process.This saves two solutions with the matrix R + αS.Since we need few iterations, such a saving can be important to decrease the total expense of the method.
To solve (179) we use R + αS as a preconditioner.The resulting preconditioned matrix takes the form This form can also be used to derive eigenvalue estimates in more general cases than was done above.If R is nonsingular, we find Therefore, the preconditioned matrix is a rational function in the matrix R −1 S. It follows that the eigenvalues μ of E α satisfy were λ is an eigenvalue of R −1 S. We want to choose α to minimize the spectral condition number Theorem 11.Assume that R is s.p.d and S is s.p.s-d.Then, the extreme eigenvalues of the preconditioned matrix M α , defined in (182) satisfy where λ is the maximal eigenvalue of R −1 S, The spectral condition number is minimized when α = α, in which case Proof.The bounds of the extreme eigenvalues follow by elementary computations of μ = (1+λ 2 )/(1+αλ) 2 , 0 ≤ λ ≤ λ.Similarly, it is readily seen that μ max /μ max is minimized for some α in the interval α ≤ α ≤ λ, where μ max = 1.Hence, it is minimized for α = arg max α≤α (1 + α 2 ) −1 , that is, for α = α.
For applications, see [66].An important application arises when one uses Padé type approximations, and related implicit Runge-Kutta methods (see [67]), to solve initial value problems.

Historical Remarks.
Because incomplete factorization methods has had a strong influence on the development of preconditioning methods we give here some historical remarks.
The idea of an incomplete factorization method goes back to early papers by Buleev [68], Varga [10], Oliphant [69], Dupont et al. [70], Dupont [71], and Woźnički [72], where it was presented for matrices of a type arising from difference approximations of elliptic problems.The first more general form (unmodified methods for pointwise matrices) was studied for M-matrices by Meijerink and van der Vorst [46].For a review and general formalism for describing such methods, see Axelsson [18], Birkhoff et al. [63], Beauwens [12], and Il'in [60].For a similar but more involved type of methods for difference matrices, which allowed for variable parameters from one iteration to the next, see Stone [73].
A modified form of the method, where a certain row sum criterion was imposed, was studied by Gustafsson [47].Actually, as is readily seen, the method of Dupont et al. [70] and as further discussed in Axelsson [59], using a perturbation technique, can be seen as a modified version of the general incomplete factorization method when applied to the five-point elliptic difference matrices, assuming that no fill-in is accepted outside the sparsity structure of A itself and assuming a natural ordering of the grid points.The advantage of modified versions is that they can give condition numbers of the iteration matrices that are of an order of magnitude smaller than for the original matrix.
The incomplete factorization method can be readily generalized to matrices partitioned in block matrix form.This was done first for matrices partitioned in block tridiagonal form in Axelsson et al. [49] and Concus et al. [50], the latter being based on earlier work by Underwood [74].A general form was presented in Axelsson [67] and Beauwens and Ben Bouzid [52], where existence of the method was proven for M-matrices.
The existence, that is, the existence of nonzero pivot entries of pointwise incomplete factorization methods for M-matrices was first shown by Meijerink and van der Vorst [46] and, for pointwise H-matrices, by Varga et al. [75].The existence of incomplete factorization methods for Mmatrices in block form was shown in Axelsson et al. [49] and Concus et al. [50] for block tridiagonal matrices; in Axelsson [76] and Beauwens and Ben Bouzid [52], for general block matrices; and in Axelsson and Polman [51] for relaxted versions of such methods.
Kolotilina [77] shows the existence of convergent splittings for block H-matrices, and Axelsson [78] shows the existence of general incomplete factorizations for block Hmatrices.

Approximate Inverses Methods
In many applications, it is of interest to compute approximations of the inverse (A −1 ) of a given matrix A, such that these approximations can be readily used in various iterative methods.
Let G denote an approximation of A −1 .Following [6], first we present an example of an explicit and an implicit method, which is followed by a general framework for computing approximate inverses.At the end, we present an efficient way to construct symmetric and positive definite approximate inverses.
An approximate inverse to a given operator may be constructed in several ways.The simplest way is to use a Neumann expansion, that is let D −1 A = I − E, where D is the diagonal of A, for instance.
Assuming that E < 1, then the expansion is convergent and any truncated part of this series provides an approximate inverse.However, this will normally give poore approximations.As we will see, more accurate approximate inverses can be constructed as best, possibly weighted, Frobenius norm approximations.In many applications the matrix A is sparse, but the exact inverse will be just a full matrix.A natural condition on G then arises: we can impose that G has some a priori chosen sparsity pattern (the same as A or different) which will make the calculations with G easy and cheap, and also will provide a sufficient accuracy.
Let A have order n and S = {(i, j), 1 ≤ i ≤ n; 1 ≤ i ≤ j ≤ n}.Any proper subset S of S will be referred to as a sparsity pattern S ⊂ S.S L denotes the corresponding sparsity pattern for the lower triangular matrix and S L denotes the corresponding sparsity pattern for the strictly lower triangular matrix.
For simplicity, we use the same notation S for matrices having sparsity pattern S. Thus, A ∈ S if a i j / = 0 ⇔ (i, j) ∈ S.
5.1.Explicit Methods .In these methods, an approximation of the inverse A −1 of a given nonsingular matrix A is computed explicitly, that is, without solving a linear globally coupled system of equations.
Let S be a sparsity pattern.We want to compute G ∈ S, such that that is Some observations can be made from (191): (i) the elements in each row of G can be computed independently; (ii) even if A is symmetric, G in not necessarily symmetric, because g i, j , j / = i, and g j,i are, in general, not equal.

Implicit
Methods .These methods require that A is factored first.In practice, they are used mainly for band or "envelope" matrices.The algorithm was presented in [79].It is based on an idea in [80]; see also [81].
Suppose that A = LD −1 U is a triangular matrix factorization of A. If A is a band matrix then L and U are also band matrices. Let where L and U are strictly lower and upper triangular matrices correspondingly.
The following lemma holds.

Lemma 5.
Using the above notations it holds that

Journal of Electrical and Computer Engineering
Proof.Consider the following (193) Also, Since DL −1 is lower triangular and U is upper triangular, using (i) we can compute entries in the upper triangular part of A −1 with no need to use entries of L −1 .Similarly, using (ii) we can compute entries of the lower triangular part A −1 without computing U −1 .
Suppose now that A is a block banded matrix with a semibandwidth p, and we want to form A −1 also as block banded with a semibandwidth q : q ≥ p.The identities (i) and (ii) can be used then for the computation of the upper and lower parts of A −1 .

Remark 3. (i) The algorithm involves only matrix × matrix operations.
(ii) There is no need to compute any entries outside the bands.
(iii) If A is symmetric then it suffices executing only (i) or (ii).
(iv) It can be seen that (A −1 ) nn = D −1 nn .There are two drawbacks with the above algorithm.It requires first the factorization A = LD −1 U and even if A is s.p.d, the band matrix part of A −1 , which is computed, need not be s.p.d.The next example illustrates this.
Example 2. Consider an s.p.d.matrix is indefinite.

A General Framework for Computing Approximate
Inverses.It turns out that both the explicit and implicit method can be characterized as methods to compute best approximations of A −1 of all matrices having a given sparsity pattern, in some norm.The basic idea is due to Kolotilina and Yeremin [38,79], see [6].Recall that the trace function is defined by tr(A) = n i=1 a ii , which also equals n i=1 λ i (A).Let a sparsity pattern S be given.Consider the functional where the weight matrix W is s.p.d.If W ≡ I then I − GA I is the Frobenius norm of I − GA.
Clearly F W (G) ≥ 0. If G = A −1 then F W (G) = 0. We want to compute the entries of G in order to minimize F W (G), that is, to find G ∈ S, such that The following properties of the trace function will be used Then, Further, as we are interested in minimizing F W with respect to G ∈ S, we consider the entries g i, j as variables.The necessary condition for a minimizing point are then From (199) and (200), we get or Depending on the particular matrix A and the choice of S and W, (202) may or may not have a solution.We give some examples where a solution exists.
Example 3. Let A be s.p.d.Choose W = A −1 which is also s.p.d.Then, (202) implies which is the formula for the previously presented explicit method which, hence, is a special case of the more general framework for computing approximate inverses using weighted Frobenius norm.
which is the relation for the previously presented implicit method.In this case the entries of G are the corresponding entries of the exact inverse.
Example 5. Let W = I.Then, This method is also explicit.We can expect that such methods will be accurate only if all elements of A which are not used in the computations are zero or are relatively small.In some cases the quality of the computed approximation G to A −1 can be significantly improved using diagonal compensation of the entries of A which are outside S. The best approximation G to A −1 in a (weighted) Frobenius norm is in general not symmetric and, as we have seen, not always positive definite.For this reason, the next, alternate method, is considered.

Constructing a Symmetric and Positive Definite
Approximate Inverse.For some methods (as in the preconditioned Chebyshev and the conjugate gradient iteration methods) it is of importance to use s.p.d.preconditioners.As we have seen, the methods described till now do not guarantee that G will be such a matrix.
In order to compute an s.p.d.approximate inverse of an s.p.d.matrix, we can proceed as follows.It will be shown that this approximation gives a best approximation to minimize the K-condition number of the correspondingly preconditioned matrix.

A Symmetric and Positive Definite Approximate Factorized
Inverse.Seek an approximate inverse in the form G = LL T , where L ∈ S L , and let that is, denote by S L and S L the lower and strictly lower triangular part of the sparsity set S.
Theorem 12. Let A be s.p.d. and consider matrices L with sparsity pattern S l .Let the matrix L be computed by the following steps.
and L i j = 0, (i, j) ∈ S c L (the complement set).(ii) Let L = D(I − L), where Then, L ∈ S L and minimizes the K-condition number of LA L T , that is, where does not depend on d i , so we can minimize this factor independently of d i .
Consider then the general weighted Frobenius norm minimization problem min G∈S tr(I − GB)W(I − GB) T . (213) As we have seen, its solution G satisfies the relation takes the form This is an explicit method and since the minimization is done rowwise it follows from (213), with the chosen matrices G, B and W, that each of is minimized separately.By construction L satisfies (216), so the minimization problem is has the solution Consider next the first factor in (211).Here, since a geometric average is less or equal to an arithmetic average.Equality is taken if and only if all α j are equal and with no limitation we can take α j = 1, j = 1, . . ., n.Hence, which completes the proof.
The method above provides a simple and cheap method to compute approximate inverses on factorized form.The proof of the theorem shows that the K-condition number is reduced in a way as follows from the next corollary.
Corollary 2. Let L, D be defined as in Theorem 12.Then, where Hence, the trace is replaced by a product, that is the n th power of the arithmetic average is replaced the n th power of a geometric average.This is illustrated in the next example.Example 6.Let S L = {(1, 1), (2, 2), . . ., (n, n)}, that is, let L be a diagonal matrix.Then, we find d 2 i = a ii and Corollary 2 shows that which is to be compared with that is, we have Hence, the K-condition number K(LAL T ) of the diagonally scaled matrix LAL T is substantially smaller than K(A) if the geometric average g of the diagonal entries a ii of A are much smaller than their arithmetic average a.This holds when the entries a ii vary significantly.Note that it always holds that g ≤ a.
We conclude this section by mentioning that the Kcondition number can be take large values even for seemingly harmless eigenvalue distributions.
Example 7 (Arithmetic distribution).Let 0 < λ 1 < λ 2 < • • • < λ n , the eigenvalues of B be distributed uniformly as an arithmetic sequence in the interval [a, b], a = λ 1 , b = λ n .For simplicity, assume that n/2 is even.Then, On the other hand, (57 (1).In particular, if b/a does not depend on n then we have k ≤ O(ln 1/ε).Therefore, the estimate in Theorem 6 inferior even to the simple estimate in (56).For other distributions, however, Theorem 6 can give a smaller upper bound.

Augmented Subspace
Preconditioning Method

Introduction; Preconditioners for Very Ill-Conditioned
Problems.In this section, we consider the solution of systems Ax = b, where A is an n × n matrix which is symmetric and positive definite (s.p.d) and can have a very large condition number, that is, be ill-conditioned.Such systems arise typically for near-limit values of some problem parameter.
(Ratio of material coefficients, aspect ratio of the domain, nearly incompressible materials in elasticity theory, etc.)The condition number can be additionally very large due to the size of the matrix A (a small value of the discretization parameter) and also due to an irregular mesh and/or large aspect ratios of the mesh in partial differential equation (PDE) problems.
If the size of the system is not too large one can use direct solution methods, possibly coupled with an iterative refinement method.
Let B = LDL T ( or B = L L T ) be a triangular matrix or the Cholesky factorization of A. Due to finite precision computations (say, in single precision) in general B is only an approximation of A. The iterative refinement method takes the following form.
Frequently, it suffices with one iterative refinement step.The ability of iterative refinement to produce a more accurate solution vector depends crucially on how the computation of the residual vector r (k) in (i) is implemented.A safe way is to use double precision for this computation but possibly single precision in (ii) and (iii).However, as described in [30], if one rewrites the computation of Ax (k) as a sum of differences, in some cases it suffices to use single precision in (i) also.
The computational labor is normally dominated by the initial factorization of A. For large systems this cost can become too big as it grows in general fast with problem size.(For an elliptic difference problem on a 3D N × N × N mesh it grows as O(N 7 ) for certain band-matrix orderings.Furthermore, the demand of memory to store the factor L grows as O(N 5 ).For certain nested dissection and other orderings, the complexity is somewhat reduced, however.)Therefore, iterative solution methods become the ultimate methods of choice.As we have seen in Section 1, the basic idea behind the iterative solution technique is to use a cheaper (incomplete) factorization or other approximation B of A and to compensate for this approximation by repeating the steps in the iterative refinement method until the residual is sufficiently small.In addition, to speed up the convergence of the method, one or more acceleration parameters are introduced, for instance, (iii) becomes x (k+1) = x (k) + α k d (k)  for certain parameters α k .
By a proper choice of B the number of required iterations may not be too big while the expense in solving systems with matrix B may not be larger than the order of some "work units", for instance, can correspond to a few actions (matrixvector multiplications) of A on a vector.In this way, one can gain significantly in computational labor and less demand of memory resources as compared with a direct solver.Actually, the direct solver can be viewed as an approximate factorization with the full amount of fill-in allowed, while as we have seen, in a incomplete factorization method one controls the amount of fill-in either by using a predetermined sparsity pattern in L or by allowing a variable pattern, which depends on some relative drop tolerance.(Such a drop tolerance is to delete a fill-in entry a i j if there holds |a i j | ≤ ε √ a ii a j j , j / = i for some ε, 0 < ε < 1.More details can be found in [82]).
A problem with iterative solution methods for illconditioned systems is that they may stagnate, that is, there is no further improvement as the method proceeds.This occurs typically for minimum residual or minimum A-norm methods.For other type of methods even divergence may be observed.Another problematic issue is the fact that if the residual norm has taken a small value, this does not necessarily mean that the error norm is sufficiently small, since and here λ min (A) takes very small values for ill-conditioned systems.Hence, even if r (k) is small, x − x (k) may still be large.For ill-conditioned systems one sees then typically a reduction of the residual to some limit value while the errors hardly decay at all.This was illustrated in Section 3.
For studies on the influence of inexact arithmetics, see for example [83][84][85].This situation can be significantly improved by using a proper preconditioner.Then, where is the so called preconditioned or pseudo-residual.Here λ min (B −1 A) λ min (A) with a proper preconditioner.Therefore, the importance of choosing a proper preconditioner is twofold: (1) to increase the rate of convergence while keeping the expense in solving systems with B low, and (2) to enable a small error norm when the pseudoresidual is small.
Preconditioning methods, such as the modified incomplete factoriztion method, multigrid and multilevel methods, aim at reducing arror components correponding both to the large eigenvalues with rapidly oscillating components and the smaller eigenvalues for smoother eigen functions.In the modified method, this is partly achieved by letting the preconditioner by exact for a particular smooth component of the solution, such as for the constant component vector.It has been shown, see [6,47] when applied for elliptic difference problems, that under certain conditions the spectral condition number is reduced from O(h −2 ) to O(h −1 ).In multigrid methods, one works on two or more levels of meshes where the finer grid component should smooth out the fast, oscillating components in the iteration error, while the coarser mesh should handle the smooth components.Under certain conditions, such methods way reduce the above condition number to optimal order, O(1), as h → 0.
As it turns out, such standard preconditioning methods, namely (modified) incomplete factorization ((M)ILU), [46,47], Multigrid (MG) [90], or Algebraic Multilevel Iteration (AMLI), [95][96][97], methods may not be efficient in both and in particular, in the second of the above mentioned requirements.This might be due to the fact that the smallest eigenvalue (in the preconditioned system) is caused by some problem parameter which these methods leave unaffected.Therefore there is a demand for new types of preconditioners (or new combinations of already known preconditioners).To satisfy the above need, two types of preconditioners have been constructed: (a) deflation methods, (b) augmented matrix methods, which we now describe.
Since normally λ max (A) and λ max (B −1 V A V ) are readily estimated, the given choice of σ is viable.It may increase the maximal eigenvalue with a factor 2, which is acceptable.

Corollary 4. If κ(B −1
V A V ) ≤ λ max (A)λ m+1 , then that is, the upper bound, coincides with the bound where B V = A V .In particular, if κ(A V ) ≤ λ max (A)/λ m+1 , then one can simply let B V = I or Hence, seen that the matrix A V in the preconditioner can be replaced with a simpler matrix B V where the action of B −1 V is cheap, without deteriorating the eigenvalue bounds.
It still remains to weaken the assumption as this is not easy to satisfy in practice.Due to space limitations this will not be presented here.Instead, refer to [56].

Krylov Subspace Methods for Singular Systems
Singular systems, that is, with a nontrivial kernel, arise in various applications, such as in boundary value problems with pure Neumann type boundary conditions imposed on the whole boundary.Nullspaces of large dimension may arise in finite element methods using edge element methods, see, for example, [104] and in the analysis of Markov chains when stationary probability vectors of stochastic matrices are computed, see [105,106], see also [107].
A singular system does not always have a solution and it is more appropriate to consider the least squares problem: find x ∈ R n such that b − Ax ≤ b − Ax for all x ∈ R n .We recall that a basic iterative solution method to solve a linear system, either has the form or the preconditioned form solve Bδ k = τ k r k , (254) and update Here, B is a preconditioner to A. Similarly, as we have seen, more involved methods, such as (generalized) CG-methods (GCG, GMRES, GCR, etc.) are based on approximations taken from a Krylov subspace K A, r 0 , k = r 0 , Ar 0 , . . ., A k r 0 or K B −1 A, r 0 , k . (256) In general, they are based on a minimum residual approach where, at each iteration step, we compute an updated solution that satisfies the best approximation property We will show that the convergence of such iterative solution methods can stall, or suffer a breakdown, when applied to certain singular systems.For the analysis, we will use the following properties of relevant subspaces for matrix A. In this generality, they can be stated for rectangular matrices.
Definition 2. Let A ∈ R m×n .Then R(A) of dimension ≤ n, called the range of A, is the subspace spanned by the column vectors a .j that is y ∈ R(A) ⇐⇒ y = n j=1 α j a .j .
(258) Definition 3. N (A), of dimension ≤ n, is the nullspace of A, that is, the subspace of vectors v ∈ R n s.t.Av = 0.By a classical result (see e.g., [6]), it holds R(A) ⊥ = N (A T ) Definition 4. A linear system Ax = b is called consistent if b ∈ R(A) and inconsistent otherwise.If Ax = b is consistent, there exists a solution.Cle arly, any system with a nonsingular matrix A is consistent.
To provide a general method, applicable for all types of systems, we will use a least squares type of method, that is determine an approximate solution to x s.t.min x b − Ax (or, similarly, a preconditioned form).
In practice, the approximations are mostly computed by a Krylov subspace method, where at each step a solution x k is computed such that or a preconditioned form of the method.As the next Theorem shows even if the system is consistent, a breakdown can occur.
Theorem 17.Any minimum residual Krylov subspace method may suffer a breakdown for some initial vector if and only if R(A) ∩ N (A) contains a nontrivial vector.
Proof.The sufficiency follows since if R(A) ∩ N (A) / = {0}, there exists a nonzero vector x ∈ N (A) which is also in R(A).Then, at some stage k, there exists a vector r k = y, where Ay = x, x ∈ N (A).Hence, Ar k = Ay = x, which implies A 2 r k = Ax = 0, so a zero vector arises in the Krylov subspace at some stage.This means that convergence stalls.On the other hand, R(A) ∩ N (A) = {0} implies the existence of nonzero vectors A k r 0 of any order in the Krylov subspace, which implies that there is an improved approximate solution for each higher stage k + 1.

Journal of Electrical and Computer Engineering
Example 8. Let Here, Ae = 0, e T = (1, 1, , 1, 0), that is, e ∈ N (A).Furthermore, there is a solution of Ay = e, for instance Hence, e ∈ R(A) ∩ N (A), where e / = 0. Since A 2 r 0 = 0, with y as initial vector, the Krylov subspace for the system Ax = 0 stalls at the second step.Therefore, if AUx = Dx = 0 for some x / = 0, then also A T Ux = Dx = 0, so for any such vector x.Since U is nonsingular, this implies N (A) = N (A T ).
Remark 5. Corollary 5 can be extended to H-normal matrices, that is matrices for which A commutes with its Hadjoint, for some Hermitian matrix H see, for example, [6].
A remedy to avoid breakdowns for matrices A for which the vector space V = R(A) ∩ N (A) is nontrivial, is to work in a subspace orthogonal to V.This can be achieved by use of the augmented subspace projection method in Section 6.This method works also to avoid situations, where R(A) ∩ N (A) contains eigenvectors to A corresponding to nearly zero eigenvalues, causing a near breakdown or, in finite precision computations, an actual breakdown.For further comments on near breakdowns, see, for example, [83][84][85].

Concluding Remarks
Some milestones in the development iterative solutions methods have been presented.By the combination of improved methods and the developments of computer hardware one can presently solve problems with a degree of freeadoms nearly billionfold compared to that in the early ages of the computer age.
There remains still, however, very difficult problems such as in multiphysics and heterogeneous media problems and various forms of inverse problems, which need further improvement of solution methods.
Some problems, such as those arising in constrained optimization and mixed finite element methods, lead to matrices on saddle pointform.Due to space limitations, they have not been discussed in this paper, however, see, for example, [108].In the last centuries, much work has been devoted to multigrid, algebraic multigrid and multilevel iteration methods which have shown an optimal order of performance for many types of problems, for example see [58].Also, domain decomposition methods which go back to the Schwarz alternating decomposition method, have shown developments, see, for example, [109][110][111][112].For an early survey of domain decomposition methods, see [113].For the same reason, they could not be discussed in this paper.Much work has also been developed to parallelization aspects of solution methods.This topic deserves a separate survey article and has also not been discussed in this paper.

Theorem 3 .
Assume that (a) A has property (A π ), and (b) the block matrix B = I − D −1 A has only real eigenvalues.

Theorem 8 .
Let A = D + L + U, where D is s.p.d.Let V , H, D be defined by (120), and assume that either (a) or (b) holds, where (a) U = L * (b) L, U are s.p.d. and each pair of matrices L, U, D commute.Then the eigenvalues λ of the matrix C −1 A, where C is defined in (122), are contained in the interval 0 < λ ≤ 1.

Theorem 9 .
Let A = D + L + U be a splitting of A, where A and D are s.p.d. and either (a) U = L * or (b) L, U are s.p.d. and each pair of D, L, U commute.Then, the eigenvalues of matrix C −1 A, where min x∈K(A,r 0 ,k) b − Ax or min x∈K(B −1 A,r 0 ,k) b − Ax .(257)

Corollary 5 .
(a) If N (A) = N (A T ), in particular if A is symmetric, then minimal residual type methods-based the Krylov subspace converges.(b) More generally, this holds if A is a (H−) normal matrix.Proof.(a) Since R(A) ⊥ = N (A T ), it holds R(A) ⊥ = N (A) and, hence R(A) ∩ N (A) = R(A) ∩ R(A) ⊥ = 0 .(262)(b) For a normal matrix, there exists a unitary matrix U that diagonalizes A, that isU T AU = D. (263)Hence,U T A T U = D T = D. (264) is a weak regular splitting, then the splitting is convergent if and only if A is monotone.