Linear Algebra and its Applications Taylor’s theorem for matrix functions with applications to condition number estimation

We derive an explicit formula for the remainder term of a Taylor polynomial of a matrix function. This formula generalizes a known result for the remainder of the Taylor polynomial for an analytic function of a complex scalar. We investigate some consequences of this result, which culminate in new upper bounds for the level-1 and level-2 condition numbers of a matrix function in terms of the pseudospectrum of the matrix. Numerical experiments show that, although the bounds can be pessimistic, they can be computed much faster than the standard methods. This makes the upper bounds ideal for a quick estimation of the condition number whilst a more accurate (and expensive) method can be used if further accuracy is required. They are also easily applicable to more complicated matrix functions for which no specialized condition number estimators are currently available. © 2016 The Authors. Published by Elsevier Inc. This is an

We derive an explicit formula for the remainder term of a Taylor polynomial of a matrix function. This formula generalizes a known result for the remainder of the Taylor polynomial for an analytic function of a complex scalar. We investigate some consequences of this result, which culminate in new upper bounds for the level-1 and level-2 condition numbers of a matrix function in terms of the pseudospectrum of the matrix. Numerical experiments show that, although the bounds can be pessimistic, they can be computed much faster than the standard methods. This makes the upper bounds ideal for a quick estimation of the condition number whilst a more accurate (and expensive) method can be used if further accuracy is required. They are also easily applicable to more complicated matrix functions for which no specialized condition number estimators are currently available.

Introduction
Taylor's theorem is a standard result in elementary calculus (see e.g. [17]). If f : R → R is k times continuously differentiable at a ∈ R, then the theorem states that there exists R k : R → R such that and R k (x) = o(|x − a| k ) as x → a. Depending on any additional assumptions on f , various precise formulae for the remainder term R k (x) are available. For example, if f is k + 1 times continuously differentiable on the closed interval between a and x, then for some c between a and x. This is known as the Lagrange form of the remainder.
Alternative expressions, such as the Cauchy form or the integral form for the remainder are well known [17].
Taylor's theorem generalizes to analytic functions in the complex plane: the remainder must now be expressed in terms of a contour integral. If f (z) is complex analytic in an open subset D ⊂ C of the complex plane, the kth-degree Taylor polynomial of f at a ∈ D satisfies and Γ is a circle, centred at a, such that Γ ⊂ D. See [1, Chap. 5, Sec. 1.2] for a proof of this result. The first goal of this paper is to generalize (2) to matrices, thereby providing an explicit expression for the remainder term for the kth-degree Taylor polynomial of a matrix function. Note that it will not be possible to obtain an expression similar to (1) because its derivation relies on the mean value theorem which does not have an exact analogue for matrix-valued functions. Our second goal is to investigate applications of this result in bounding the derivatives and condition numbers of matrix functions via pseudospectra.
Convergence results for Taylor polynomials of matrix functions have been known since the work of Hensel [8], Turnbull [20], and Weyr [21] (see [11,Thm. 4.7] for a more recent exposition). Mathias [15] also obtains a normwise truncation error bound for matrix function Taylor polynomials which form part of the Schur-Parlett algorithm [4]. There are also a number of remainder theorems within the operator theory literature which can be applied to matrix functions. However, to our knowledge, this paper represents the first time an explicit remainder term (as opposed to a bound) has been specifically obtained for the Taylor polynomial of a matrix function.
The remaining sections of this paper are organized as follows. In section 2 we state and prove the remainder term for the kth-degree Taylor polynomial of a matrix function. In section 3 we investigate some applications of this result by bounding the first order remainder term using pseudospectral techniques and relating it to the condition number of f (A). In section 4 we extend these results to the level-2 condition number of a matrix function, introduced in [13]. In section 5 we examine the behaviour of the pseudospectral bounds on some test problems and show that they can be computed efficiently. Finally in section 6 we present our conclusions and discuss some potential extensions of this work.

Remainder term for Taylor polynomials
The Taylor series theorems found in Higham's monograph [11] primarily involve expanding f (A) about a multiple of the identity matrix I: Our starting point is the more general Taylor series expansion in terms of Fréchet derivatives, obtained by Al-Mohy and Higham [2,Thm. 1]. Suppose that f has a power series expansion ∞ j=0 a j x j with radius of convergence r > 0 centered at the origin. The interior of the circle |x| < r defines a simply connected open set D. Then, given A, E ∈ C n×n with Λ(A), Λ(A +E) ⊂ D (where Λ(X) denotes the spectrum of the matrix X), Al-Mohy and Higham proved that where D [j] They called the D [j] f (A, E) terms Fréchet derivatives. More precisely, the terms D [j] f (A, E) are a special case of the jth order Fréchet derivatives described by Higham and Relton [13], in which the perturbations in the j directions are all E. The first of these terms, D [1] f (A, E), coincides with the "standard" Fréchet derivative L f (A, E). Additionally, if A and E commute then we have D Before writing down the remainder term obtained by truncating the Taylor series in (3), we first recall the standard result that, for any invertible A and B, We will also need the following lemma.
where X denotes the derivative of X, and that, since higher derivatives of X vanish, The result then follows by substituting X = A − tB and setting t = 0. 2 Furthermore, we note that by the Cauchy-Hadamard theorem any power series in the complex plane converging to a function f must converge on a circular domain with radius of convergence r (which can be infinite). In the following results, for the purpose of maximizing generality, we say that f has a power series expansion which converges on a simply connected open set D. Clearly D must be a subset of this circular domain, but need not be circular itself. The reason for this distinction is that the -pseudospectra of A, used in section 3, give rise to sets of differing shape.
We now state and prove the main result of this paper, which gives an explicit form of the remainder term when truncating (3).
and Γ is a closed contour in D enclosing Λ(A) and Λ(A + E).
Proof. The result is proved by induction on k. For the case using the Cauchy integral definition of a matrix function. It follows from (5) that For the inductive step, we assume that f (A + E) = T k (A, E) + R k (A, E). The remainder for the (k + 1)st degree Taylor polynomial is given by Substituting the inductive hypothesis for R k (A, E) and the Cauchy integral form for f (A + tE), assuming that t is sufficiently small, gives By the continuity of f we can apply the Leibniz integral rule to differentiate the integrand in the second term and simplify it using Lemma 2.1. We obtain where (5) has been used once more. This completes the proof. 2 We end this section by briefly describing how Theorem 2.2 also allows us to obtain a remainder term for Padé approximants (this was first done in the scalar case by Elliot [5]).
Suppose that we approximate f (z) using a rational function p m (z)/q n (z), where p m (z) and q n (z) are polynomials of degree m and n respectively. The Padé approximant is the unique choice (up to scalar multiples) of p m (z) and q n (z) such that f (z) −p m (z)/q n (z) = O(z m+n+1 ). Therefore, using the same rational function to approximate the corresponding matrix function, we have q n (X)f (X) −p m (X) = O( X m+n+1 ). We introduce the truncation error term S m,n (X) to the Padé approximant such that Then, by rearranging the above, The term q n (X)S m,n (X) is then the remainder term if we consider p m (X) to be a power series expansion of q n (X)f (X) when we set A = 0 and E = X. The remainder has degree at least m + n and so, by applying (7) with k = m + n, we obtain where the closed contour Γ encloses Λ(X) and the origin.

Application to condition numbers and pseudospectra
In this section we use Theorem 2.2 to study the behaviour of the condition number of a matrix function, which measures the sensitivity of f (A) to small perturbations in A. The results in this section are applicable for any induced matrix norm. Our approach requires borrowing a number of techniques from the analysis of pseudospectra. Recall that the -pseudospectrum of a matrix X is the set To begin, the following lemma provides some pseudospectral bounds on the size of the remainder terms.

Lemma 3.1. Let f and D satisfy the criteria of Theorem 2.2. Furthermore let > 0 be such that Λ (A) ⊂ D and Λ (A + E) ⊂ D, and take Γ ⊂ D to be a closed contour that encloses both Λ (A) and Λ (A + E). Then the remainder term R k (A, E) is bounded by
where L is the length of Γ . In particular, when a circular contour centered at 0 is used, where ρ = max{|z| : z ∈ Λ (A + E) ∩ Λ (A)} is the radius of the circle.
(Note that tildes on L , Γ , and ρ are used because, for this result only, the contour needs to enclose Λ (A + E) in addition to Λ (A). For subsequent results, the contour need only enclose Λ (A) and the tildes are dropped.) Proof. The proof is analogous to that of the bound obtained by Trefethen and Embree [19,Ch. 14]. We bound the norm of R k (A, E) by noting that The first part of the lemma follows immediately. For the second part, take Γ to be a circle with center 0 and radius ρ = max{|z| : We can also use this result to bound the absolute condition number of a matrix function. Recall that the absolute condition number measures the first order sensitivity of f (A) to small perturbations in A and is given by [11,Chap. 3] cond abs (f, Lemma 3.1 provides us with the following bound on the absolute condition number.

Corollary 3.2. Let f and D satisfy the criteria of Theorem 2.2. Let
> 0 be such that Λ (A) ⊂ D, and let Γ ⊂ D be a closed contour of length L that encloses the -pseudospectrum. Then In particular, when a circular contour centered at 0 is used, where ρ = max{|z| : z ∈ Λ (A)} is the pseudospectral radius of A.
Proof. Set k = 0 in (9). Consider E = α < so that, by an equivalent definition of the -pseudospectrum, we have Λ We divide by α and take the supremum over all E such that E ≤ α to obtain The proof of (12) is completed by taking the limit α → 0 and recalling that the absolute condition number of a matrix function is given by the operator norm of the Fréchet derivative (11). The proof of (13) is essentially the same, except that (10) is taken as the starting point rather than (9).
Note that an alternative proof of the corollary can be obtained by starting with the integral representation of the Fréchet derivative and bounding it above using the techniques from the proof of Lemma 3.1. 2 Assuming that these bounds can be computed efficiently they are of considerable interest as most existing results regarding the estimation of the condition number provide only lower bounds [11,Chap. 3]. Indeed, this is particularly interesting when combined with a bound on the size of the -pseudospectrum given by the following result. Since the numerical radius, r(A) := sup z∈W (A) |z|, is equal to A 2 we know that the -pseudospectral radius is no larger than A 2 + . Thus we obtain the following corollary.
Proof. The circle of radius A 2 + around the origin encloses Λ (A) and is of length 2π( A 2 + ). Using this contour in (13) gives the desired result. 2 One potential application of this result is in the design and analysis of algorithms for computing matrix functions. Many such algorithms work by rescaling A to be of small norm, applying the function to this scaled matrix (via a Padé approximant or Taylor series), and then undoing the effect of the scaling. This corollary may allow us to better understand the numerical effect of applying the matrix function to the scaled matrix, since such analysis is typically done only in exact arithmetic.
We end this section by briefly mentioning a related theorem due to Lui [14,Thm. 3.1], concerning the relationship between the pseudospectra of A and f (A). The theorem is restated here in our notation. Recall that R k (A, E) was defined in Theorem 2.2 and that This result shows that, to first order in , the -pseudospectrum of A is related to the

Application to higher order condition numbers
Higham and Relton [13] introduce the level-q condition number for matrix functions, which is defined recursively by cond (q) where cond (1) abs (f, A) := cond abs (f, A). In section 3 we focused on the first order remainder term, R 0 (A, E), and results concerning the condition number cond abs (f, A) but -by choosing k > 0 in Lemma 3.1 -we can attempt to extend results such as Corollary 3.2 to these higher order condition numbers.
Before proceeding, we must first investigate the relationship between the D [j] f (A, E) defined in (4) and higher order Fréchet derivatives. Recall that D [j] f (A, E) is a special case of the jth order Fréchet derivative in which the perturbation in each direction is E.
In [13] a definition of the jth order Fréchet derivative, assuming it is continuous in A, is given in terms of the mixed partial derivative: The following theorem expresses this jth order Fréchet derivative in terms of a contour integral.
where S j is the set of permutations of {1, 2, . . . , k}. In particular the derivative D [j] Proof. For any choice of s i (in some neighbourhood of 0) and E i , we can write f (A + s 1 E 1 + · · · + s j E j ) as a Cauchy integral by using the standard Cauchy integral definition of a matrix function and choosing a contour Γ that encloses Γ and some neighbourhood of Λ(A). Then (16) becomes Using the Leibniz integral rule, the differential operator can be brought inside the integral sign. The required integrand is then obtained by using the identity The result (17) follows by then restricting the contour to the closed curve Γ containing Λ(A). The second part of the theorem, (18), follows by setting E 1 = · · · = E j . 2 Theorem 4.1 shows that, to first order, the kth remainder term in the Taylor series is simply the (k + 1)st derivative, as we might expect. Specifically, comparing (18) with (7) we find In addition, Theorem 4.1 allows us to prove the following theorem, which uses the pseudospectrum of A to bound the norm of the jth order Fréchet derivative.

Theorem 4.2. Let f satisfy the criteria of Theorem 4.1 and let Γ be a closed contour enclosing Λ (A) such that f is analytic inside and on Γ . Then the jth order Fréchet derivative can be bounded by
where L is the length of Γ .
Proof. In (17) use the contour Γ , take norms, and note that (zI − A) −1 ≤ −1 on Γ . 2 It would be desirable to obtain a bound on the level-q condition number, by first bounding it in terms of the norm of the qth Fréchet derivative and then applying Theorem 4.2. However, in the general case such bounds prove to be far too weak to be of any interest. Instead we restrict ourselves to the case q = 2 and the level-2 condition number.
When a circular contour centered at 0 is used, where ρ , the pseudospectral radius, is the radius of the circle.
Proof. Higham and Relton [13,Sec. 5] give an upper bound for the level-2 absolute condition number in terms of the norm of the second Fréchet derivative cond (2) Substituting the bound from (19) into (20) gives the required results. 2

Numerical experiments
In this section we show how our pseudospectral bounds on the condition number, (12) and (13), can be used to estimate the condition number of matrix functions in practice. We also find that they are cheaper than alternative approaches and, therefore, one might use the pseudospectral bound as a quick estimate of the condition number. If this estimate is unsatisfactorily large we can use existing methods to estimate it more accurately. The term "unsatisfactorily large" can be made precise in the following manner: many applications only require the first few digits of the result to be correct so that a relative error of, for example, 1e-4 is perfectly acceptable. When using a backward stable algorithm the relative error is approximately bounded above by the condition number multiplied by the unit roundoff (u = 2 −53 in IEEE double precision arithmetic).
Throughout this section, to compute our bound on the condition number, we will be using (12) cond where Γ is a closed contour of length L that encloses the pseudospectrum of A and lies within the region where f has a convergent power series. Recall also that the relative condition number, cond rel (f, A), is given by Combining these two results allows us to bound the relative condition number from above. This bound will be cheap to compute provided that the cost of computing L and max z∈Γ |f (z)| is sufficiently small. In order to use this bound in practice we must choose which matrix norm to consider, the value of , and the contour Γ . We will use the Frobenius norm since, in this norm, there is an explicit formula for the condition number which can be computed using [11,Alg. 3.17]. However, the pseudospectrum is not defined in the Frobenius norm, since it requires the use of an induced norm. To resolve this, one can easily show that the absolute condition number in the Frobenius norm is bounded above by √ n times the condition number in the 2-norm, where n is the size of the matrix. Hence we have The right-hand side of this equation is what we will compute, where the cond abs (·) term is bounded above by (12). It remains to choose and Γ . Looking at (12) we see that, heuristically, in order to minimize the upper bound we would like to be reasonably far from 0. Some of our test functions will have power series that are convergent in a circle of radius 1 around the point z = 0; for these cases we choose Γ to be a circle centered at 0 with radius 0.99 and find the largest such that the -pseudospectral radius lies inside this circle. This is computed using the nonlinear optimization routine fminbnd in MATLAB. When our function has a power series with an infinite radius of convergence we choose = 1 and take Γ to be a circle centered at the mean of the eigenvalues of A (γ = 1 n λ i ) with radius equal to the -pseudospectral radius of A − γI. Finally, to find max |f (z)| on these contours, we again use the nonlinear optimization routine fminbnd in MATLAB. We use psapsr by Guglielmi and Overton [7] to compute the -pseudospectral radii throughout.
We will compare our pseudospectral method described above (hereafter referred to as condpseudo) in the Frobenius norm against two alternative methods for computing the condition number: funm_condest_fro from the Matrix Function Toolbox [10] and an "exact" method detailed by Higham [11,Alg. 3.17], which we refer to as condold and condexact, respectively.
The method condold uses finite difference approximations to the derivatives of the matrix function and has O(n 3 ) cost. Meanwhile condexact expresses the condition number as the 2-norm of a matrix K f (A) ∈ C n 2 ×n 2 , called the Kronecker form of the matrix function, which must be computed explicitly with cost O(n 5 ). Therefore condexact is impractical for all but the smallest problems.
Our first experiment compares condpseudo and condold to condexact in terms of accuracy and reliability on a range of matrix functions. The purpose of this experiment is to confirm that condpseudo does indeed return an upper bound on the condition number and that this upper bound is not much larger than the exact value. We compare the three different algorithms on four matrix functions corresponding to the scalar functions log(1 +x), (1 +x) 1/15 , exp(x), and cos(x). The first two of these have a power series representation which is convergent for |x| < 1, whilst the latter have globally convergent power series. The matrix functions are computed using logm and expm in MATLAB, along with cosm from [3], and powerm_fre_new by Higham and Lin [12].
For each function we use 29 test matrices (of size n = 10) from the Matrix Computation Toolbox [9] and plot both the computed condition numbers and the ratio of condpseudo and condold to condexact. For the first two functions, where we need all eigenvalues to lie within the region of convergence, we transform each matrix to have eigenvalues centered at 0 with A 2 = 1 so that all eigenvalues lie within the unit disk.
In Fig. 1 we see the condition number as computed by the three methods, for each of the 29 test matrices, using the function f (x) = log(1 + x). The results are ordered by decreasing condition number as computed by condexact. We can immediately see that condpseudo is indeed an upper bound and is usually 2-4 orders of magnitude larger than the exact condition number. Meanwhile condold is generally a very good estimate of the condition number.
Next in Fig. 2 we compare the condition numbers when using the function f (x) = (1 + x) 1/15 . In this case we see very similar behaviour to the previous function: condold and condexact are almost identical whilst condpseudo provides an upper bound that is generally 2-4 orders of magnitude larger than the true condition number. Fig. 3 shows the results when using f (x) = exp(x). In this case condpseudo performs slightly better than previously being only 1-3 orders of magnitude above condexact on most test problems.
Finally, Fig. 4 displays the results for f (x) = cos(x). In this case, as for the exponential we see that condpseudo is a reliable upper bound, generally being 1-3 orders of magnitude above condexact except for one case on the left-hand side in which it is more than 10 orders of magnitude larger. This is due to the matrix in question having eigen-   values extending far into the complex plane: as the cosine function grows exponentially in the direction of the imaginary axis | cos(z)| is extremely large on the chosen contour.
Each of these four cases shows that condpseudo provides a reliable upper bound on the condition number and is generally just a few orders of magnitude above the true value. We also note that, since condpseudo needs only the scalar function f (x) and does not need to compute the derivatives of a matrix function, via finite differences or otherwise, it can easily be applied to very complicated matrix functions such as cos( √ A) (for which no specially designed algorithms exist) with no modification. In this particular case we would apply our algorithm to the function which is analytic and equivalent to cos( √ A) away from the origin, regardless of which branch of the square root function is selected. Matrix functions such as this can arise in finite element semidiscretization of the wave equation. For example, the second order differential equation has the solution where √ A denotes any square root of A [6, p. 124], [18]; see also [11,Prob. 4.1] for the case g(t) = 0.
Our next experiment compares the speed of estimating the condition number as the size of the matrix grows. Here we focus on the function f (x) = (1 + x) t for t = 1/5, 1/15, 1/52 and for n between 10 and 1000. For each value of n we take A to be a matrix with elements normally distributed with zero mean and unit variance, scaled to have unit norm. Since condexact is an O(n 5 ) algorithm it becomes increasing impractical as n grows: instead we will compare condpseudo against condold and a different algorithm, condhili. This latter algorithm, designed by Higham and Lin [12], estimates the condition number of matrix powers in a similar manner to condold but actually computes the derivatives of the matrix function, as opposed to using finite difference approximations. The algorithm is designed to estimate the condition number in the 1-norm but has similar computational complexity to condold which works in the Frobenius norm. This experiment was run on a laptop with an Intel dual-core i7 processor using MATLAB R2014b. Fig. 5 shows the results of this experiment. On the left are the runtimes using each of the 3 algorithms to compute the condition number of (I + A) 1/t for the various values of t whilst the right-hand plot shows the speedup when using condpseudo relative to the other methods. The x-axis shows n, the size of the matrices, whilst the y-axis shows the runtime in seconds (left-hand plot) and the speedup obtained (right-hand plot). We see that condpseudo is much cheaper than the alternatives for fairly small matrices and appears to settle at around 1.5 times faster than condold and 2 times faster than condhili, respectively, on this machine. This would suggest that using condpseudo is beneficial for applications where low-accuracy solutions are required and is particularly good in situations where lots of small matrix functions need to be computed.

Conclusions
The main results in this paper are as follows. We have obtained an explicit expression for the remainder term of a matrix function Taylor polynomial (Theorem 2.2). Combining this with use of the -pseudospectrum of A leads to upper bounds on the condition numbers of f (A). Our numerical experiments demonstrated that our bounds can be used for practical computations: they provide a cheap upper bound on the condition number which is often only a few orders of magnitude too large. This means that our bounds could be used as a quick estimate of the condition number and if this estimate is too large, for instance if the estimate suggests that an insufficient number of correct significant figures might be obtained in computing f (A), then existing methods can be used to obtain the condition number more accurately.
Another benefit of our approach is that it can easily be applied to bound the condition number of complicated matrix functions such as cos( √ A) without modification, as there are currently no specialized methods for computing such quantities.
Our results may also have further useful applications in the development of matrix function algorithms by allowing us to estimate the size of the remainder terms for Padé approximants, for example. We may also be able to glean further insight into the behaviour of existing algorithms to compute matrix functions (see the discussion of Corollary 3.4). This will be the subject of future work.