Sharper Probabilistic Backward Error Analysis for Basic Linear Algebra Kernels with Random Data

. Standard backward error analyses for numerical linear algebra algorithms provide worst-case bounds that can signiﬁcantly overestimate the backward error. Our recent probabilistic error analysis, which assumes rounding errors to be independent random variables [ SIAM J. Sci. Comput. , 41 (2019), pp. A2815–A2835], contains smaller constants but its bounds can still be pes-simistic. We perform a new probabilistic error analysis that assumes both the data and the rounding errors to be random variables and assumes only mean independence. We prove that for data with zero or small mean we can relax the existing probabilistic bounds of order √ nu to much sharper bounds of order u , which are independent of n . Our fundamental result is for summation and we use it to derive results for inner products, matrix–vector products, and matrix–matrix products. The analysis answers the open question of why random data distributed on [ − 1 , 1] leads to smaller error growth for these kernels than random data distributed on [0 , 1]. We also propose a new algorithm for multiplying two matrices that transforms the rows of the ﬁrst matrix to have zero mean and we show that it can achieve signiﬁcantly more accurate results than standard matrix multiplication.

explain. We found that the backward errors of some key computational kernels depend on the data in a way not reflected in the backward error bounds, which depend only on n and norms of the data. We show an example of this behavior in Figure 1.1. The figure reports the backward error for summing n numbers randomly sampled from a uniform distribution. Even though the backward error bound for summation (which is (n−1)u in the worst case and √ n − 1 u for the probabilistic analysis from [8,Thm. 3.1]) does not depend on the summands, the figure shows that the actual backward error strongly depends on the interval the data is sampled in. For the [0, 1] interval, the backward error is of order √ nu, as predicted by the probabilistic bound, but for the [−1, 1] interval the error is much smaller, seemingly independent of n. Similar experiments showing strong variability in the error for different data distributions can be found in the literature [1], [3], [4], [5], [14], [20]. It is thus clear that the probabilistic bounds from [8] are not sharp for all data. At the same time, since most of the bounds in [8] are sharp for the [0, 1] data, they cannot be improved without additional assumptions.
In this work we perform a new probabilistic backward error analysis that uses probabilistic models of both the data and the rounding errors. It uses a martingale in order to require only mean idependence of the random variables. We prove that the difference observed in Figure 1.1 is related to the mean of the data. Our analysis shows that when the data has very small mean, such as for [−1, 1] uniformly sampled numbers, the probabilistic backward error bound for summation is reduced from √ nu to cu, where c is moderate constant independent of n. We first obtain this result for recursive summation and then apply it to inner products, matrix-vector products, and matrix-matrix products.
The error analysis motivates the following idea: given an inner product-based computation and arbitrary data, transform the data to have entries of zero mean, perform the computations, and transform back to obtain the result. We implement this idea for matrix multiplication, for which the overhead cost of the transformation is asymptotically negligible. Our numerical experiments demonstrate that this new algorithm can reduce the error by several orders of magnitude and may therefore be especially attractive for low precisions.
We begin, in section 2, by performing a probabilistic analysis of recursive summation based on the assumption that the rounding errors are mean independent random variables. In section 2.1 we derive a stronger version of a result from our previous work [8], for general data. In section 2.2 we introduce separate probabilistic models of the summands and the rounding errors, obtaining backward error bounds that depend on both the mean and the maximum magnitude of the data. We then apply our bounds to inner products, matrix-vector products, and matrix-matrix products in section 3. In section 4 we propose a new algorithm for multiplying two matrices that achieves a smaller backward error by transforming the rows of the first matrix or the columns of the second matrix to have zero mean. Finally, we provide our conclusions in section 5.
Throughout the article, we illustrate our analysis with numerical experiments carried out with MATLAB R2018b. We have made our codes available online 1 .
2. Probabilistic backward error analysis for summation. In this section, we focus on recursive summation, which is the standard method of summation that forms s = n j=1 x j by setting s = x 1 then executing s ← s + x j , j = 2 : n. We apply our analysis to inner products and matrix-vector products in section 3.
We begin in section 2.1 with some preliminary results for general data x j . In particular, we derive in Theorem 2.4 a stronger version of a result from our previous paper [8,Thm. 3.1], which does not require the assumption that the rounding errors are independent.
Then, in section 2.2, we turn to the main goal of this paper: to obtain a sharper probabilistic backward error bound by exploiting more information about the summands x j . In particular, we obtain in Theorem 2.8 an improved error bound for random independent data.
2.1. Probabilistic analysis for general data. We denote the expectation (mean) of a random variable x by E(x). We use the standard model of floating-point arithmetic [7, sec. 2.2], This model holds for IEEE arithmetic [12]. Indeed, the IEEE standard requires more: that fl(a op b) is the correctly rounded (to nearest) value of a op b. We will refer to δ as the rounding error in the operation, though this term might more naturally be applied to a op b − fl(a op b).
To derive our probabilistic error bounds we will use the following model of rounding errors in a given computation.
Note that, unlike the model used in our previous analysis [8], Model 1 does not assume independence of the rounding errors. We only require their mean independence, which is the weaker assumption that the conditional mean E(δ k | δ 2 , . . . , δ k−1 ) is equal to the unconditional mean E(δ k ) = 0. Indeed, independent random variables are also mean independent, but the converse does not hold in general. Mean independence is also a stronger property than uncorrelatedness, because E(X | Y ) = E(X) implies E(XY ) = E(X) E(Y ). This latter fact is a consequence of the law of total expectation (or tower property): [19, p. 401]. We will also need a more general form of the law of total expectation: Note that for general random variables X i , the expression E(X k | X 2 , . . . , X k−1 ) is not a real value, but a random variable itself, which takes the value E(X k | X 1 = x 1 , . . . , X k−1 = x k−1 ) when X 1 = x 1 , . . . , X k−1 = x k−1 , as defined by [17,Def. 2.7].
The backward error for an approximate sum s is where the last equality follows from the Oettli-Prager theorem [7,Thm. 7.3], [18]. We begin with a lemma on the rounding error analysis of recursive summation.
Recursive summation produces a computed s satisfying Summing this equality for i = 2 : n yields using (2.4). We conclude the proof by using We need the concept of a martingale.
Definition 2.2 (Martingale). A sequence of random variables E 1 , . . . , E n is a martingale with respect to the sequence X 1 , . . . , X n if, for all k, We will use the following inequality [17,Thm. 13.4].
Lemma 2.3 (Azuma-Hoeffding inequality). Let E 1 , . . . , E n be a martingale such that |E k − E k−1 | ≤ c k , for k = 2 : n. Then for any λ > 0, Before deriving our new results based on a probabilistic model of the data, we obtain a stronger version of a result from our previous paper [8,Thm. 3.1], using the less restrictive Model 1.
The recurrence (2.4) leads to Hence |E k | is bounded a by finite sum of bounded terms and so E(|E k |) < ∞.
We have using the mean independence of the δ k . For the last term, Hence E(E k | δ 1 , . . . , δ k−1 = E k−1 and so we have proved that E 1 , . . . , E n is a martingale with respect to δ 1 , . . . , δ n . By (2.9) and (2.8) we have Hence by the Azuma-Hoeffding inequality (Lemma 2.3) we have We make two comments on Theorem 2.4. First, if we use a different martingale with E k = k i=1 T i δ i then the proof becomes simpler, but at the cost of obtaining the first order bound | s − s| ≤ λ √ n − 1u n j=1 |x j | + O(u 2 ). Second, we could obtain a bound with smaller higher order terms by bounding the product i =j (1 + δ ) in (2.7) by a probabilistic bound [6,Thm. 4.8] instead of a worst-case bound.
Combining Theorem 2.4 with the formula (2.2) for the backward error we obtain the backward error bound ε bwd ( s) ≤ λ √ n − 1u + O(u 2 ), which is almost the same to first order as the backward error bound for inner products from [8, Thm 3.1], differing mainly in the probability P (λ). In fact, the probability of failure 1 − P (λ) = 2 exp(−λ 2 /2) in Theorem 2.4 does not depend on n, and it is n times smaller than the probability of failure in [8, Thm 3.1], which is 1 − Q(λ, n) = 2n exp(−λ 2 (1 − u) 2 /2). The reason that we are able to obtain a smaller probability of failure is that the analysis in [8] directly bounds the backward error perturbations θ j in the expression s = n j=1 x j (1 + θ j ), |θ j | ≤ ε bwd , and the condition that the bound on |θ j | must hold for all j multiplies the probability of failure by n.
We also note that Theorem 2.4 yields an almost identical bound and probability of failure as the probabilistic forward error analysis of inner products by Ipsen and Zhou [14,Cor. 4.8], which also employs a martingale.
2.2. Probabilistic analysis for random data. We now turn to the main goal of this new analysis: to obtain a sharper probabilistic backward error bound by exploiting more information about the summands x j . In order to do so, we must make some assumptions on the distribution of the x j , which we summarize in the following model.
Model 2 (probabilistic model of the data). The x j , j = 1 : n, are independent random variables sampled from a given distribution of mean µ x and satisfy |x j | ≤ C x , j = 1 : n, where C x is a constant.
We also need to modify Model 1 to generalize the assumptions made on the rounding errors. Specifically, we require the rounding errors δ k to be mean independent of both the previous δ j and the x j , as stated in the following model.
Model 3 (modified probabilistic model of rounding errors for recursive summation). Consider the computation of s = n j=1 x j by recursive summation for random data x j satisfying Model 2. The rounding errors δ 2 , . . . , δ n produced by the computation are random variables of mean zero and, for all k, the δ k are mean independent of the previous rounding errors and of the data, in the sense that Model 3 generalizes Model 1 by taking the rounding errors to be mean independent of the data in addition to being mean independent of the previous rounding errors. Models 2 and 3 are not necessarily realistic. Model 2 is clearly not always applicable, since in real world applications the x j may be neither random nor independent and they may not always be bounded. Furthermore, the rounding error in a floatingpoint operation depends on the operands, that is, δ in (2.1) depends on a and b, but in Model 3 we assume that the rounding errors are at least mean independent of the data.
The question of interest is whether our assumptions allow us to model usefully the actual rounding errors obtained in the computations we consider (cf. similar comments of Hull and Swenson [11] and Kahan [16], as discussed in [8]). We will now show that using Models 2 and 3 we can obtain insight into the conditions required for the general bound (2.6) (which is applicable to any data) to be sharp. In particular we will show that (2.6) is sharp for random x i with a nonzero mean µ x = 0 but can be improved when µ x = 0, which explains the difference between [0, 1] and [−1, 1] data previously observed in [8] and illustrated in Figure 1.1.
By Lemma 2.1, the size of | s − s| is mainly determined by the size of the partial sums |T i |. We therefore begin by deriving probabilistic bounds on |T i |, for which we need the following concentration inequality (a concentration inequality bounds the deviation of a random variable from its mean) [10, Thm. 2].
Proof. The result is obtained by applying Lemma 2.5 with n ← i, We now prove that the sequence of errors E k = k i=2 T i δ i , k = 1 : n, is a martingale with respect to T 1 δ 1 , . . . , T n δ n , and obtain a bound on |E n |.
If the x j satisfy Model 2 then under Model 3 the inequality By the law of total expectation, the second term is equal to where the second equality is obtained by noticing that fixing δ 1 , . . . , δ k−1 , x 1 , . . . , x k leads to fixed T k and T 1 δ 1 , . . . , T k−1 δ k−1 . By the mean independence of δ k with respect to δ 2 , . . . , δ k−1 and x 1 , . . . , x k−1 ((2.11) in Model 3), we thus have fails to hold with probability at most 2 exp(−λ 2 /2). Therefore, by the inclusionexclusion principle [19, p. 39], the probability that at least one of these n − 1 inequalities fails to hold is at most 2(n − 1) exp(−λ 2 /2). If all these inequalities hold simultaneously, which happens with probability at least 1 − (2n − 1) exp(−λ 2 /2), then by applying Lemma 2.3 and noting that E 1 = 0, for the same λ as above we have with probability at least 1−2n exp(−λ 2 /2). We slightly weaken the bound by replacing √ n − 1 by √ n for readability.
We are finally ready for our main result.
Theorem 2.8. Let x ∈ R n satisfy Model 2 with mean µ x and constant C x . Let s = n j=1 x j and let s be computed by recursive summation. Under Model 3, the inequality holds with probability at least P (λ) = 1 − 2n exp(−λ 2 /2).
Proof. The result is a direct consequence of Lemma 2.7, since | s−s| = |E n |+O(u 2 ) by Lemma 2.1.
We point out that unlike Theorem 2.4, which could be proved without using martingales by adding the assumption that the rounding errors are independent in Model 1, Theorem 2.8 necessarily requires martingales. This is because successive T i variables depend on each other through the recursion T i+1 = T i + x i+1 , and so the variables X i = T i δ i are clearly not independent, even with the assumption that the rounding errors δ i are.
The forward error bound (2.14) is clearly better than the deterministic bound | s−s| ≤ n 2 C x u+O(u 2 ) [7, sect. 4.2]. More importantly, it is also better than the bound | s − s| ≤ λn 3/2 C x u + O(u 2 ) obtained by bounding |x j | by C x in the bound (2.6) of Theorem 2.4. Indeed, unlike (2.6), the bound (2.14) reveals an interesting dependence between the growth rate of the forward error and the mean µ x . If µ x = 0 then the first term in (2.14) dominates and | s − s| grows as n 3/2 u, just like the bound (2.6). However, if µ x = 0 (or if |µ x | is very small) the second term in (2.14) dominates and | s − s| grows only at most as nu, which is a factor √ n smaller. Theorem 2.8 can be intuitively explained as follows. By Lemma 2.1, the size of | s − s| is determined by the size of the partial sums |T i |, which depend in turn on the mean µ x , as shown in Lemma 2.6. If µ x = 0 then |T i | i|µ x | for large i, so |T i | can grow linearly with i; however, if µ x = 0 then statistical effects associated with the data prevent |T i | from exceeding a multiple of √ i with high probability. We now analyze what bound on the backward error (2.2) can be obtained from Theorem 2.8. Since we have already obtained an upper bound for the numerator | s − s|, all that is left is to derive a lower bound for the denominator n j=1 |x j |. We need the following lemma.
Proof. Applying Lemma 2.5 with X i = w i and c i = C w we find that | n j=1 w j − nµ w | ≤ λC w √ n holds with probability at least P (λ), which implies that the inequality n j=1 w j ≥ |µ w |n − λC w √ n holds with probability at least P (λ). We conclude with We wish to apply the lemma with w = |x|. Note that the condition (1 − α)µ |x| √ n ≥ λC x in the lemma holds for large enough n as long as µ |x| = 0. Even though one can construct special distributions of x j satisfying Model 2 for which µ |x| is very small (for instance, this is the case if the x j have zero mean and very small variance), for classical distributions such as uniform or normal distributions, µ |x| is a strictly positive constant independent of n. Then n j=1 |x j | grows proportionally to n.
The backward error therefore grows at most as √ nu when µ x = 0, whereas if µ x = 0 it does not grow with n and remains close to the unit roundoff u. Corollary 2.10 therefore explains the experimental results reported in Figure 1.1 and answers an open question posed in [8].
It is worth emphasizing that the differing behavior of the backward error for different distributions is therefore related to the mean of the summands rather than their signs. This can be easily verified experimentally by sampling the summands uniformly in [−1, 3] (say), which leads to a very similar backward error to the [0, 1] case.
We comment on the role of the parameter λ in the bounds. Compared with the bound of Theorem 2.4, the new bound of Theorem 2.8 and all the later bounds in this paper have an extra term proportional to λ 2 . However, the impact of λ on the bounds is harmless, because for practical values of n (say, less than 10 10 ), small values of λ (less than 10) suffice to make P (λ) very close to 1, as already observed in [8]. Moreover, in practice, we observe the probability P (λ) to be very pessimistic; the bounds consistently hold with λ ≈ 1.
We now discuss the relative forward error | s − s|/|s|. By (2.2), this quantity is bounded by κ times the backward error, where κ = n j=1 |x j | |s| is the condition number for summation [4]. While it is tempting to derive a probabilistic bound on κ, there is not much we can actually say, as we show below. (The bounds that follow are probabilistic ones, but we do not keep track of the specific probabilities for simplicity of the discussion). The numerator grows linearly with n, since by Lemma 2.9 applied to |x| it is bounded below by α 1 µ |x| n for any α 1 ∈ [0, 1] such that (1 − α 1 )µ |x| √ n ≥ λC x and bounded above by C x n. When µ x = 0, by Lemma 2.9 applied to x the denominator similarly satisfies α 2 |µ x |n ≤ |s| ≤ C x n for any α 2 ∈ [0, 1] such that (1 − α 2 )|µ x | √ n ≥ λC x . We therefore have (with certain probabilities) the bounds and thus κ is of order 1 when all the involved constants are of order 1. The picture is entirely different if µ x = 0, because |s| may then be arbitrarily small with significant probability, and so we cannot obtain a nonzero lower bound on |s|. Together with the upper bound |s| ≤ λC x √ n from Lemma 2.6, the best we can conclude is that The condition number therefore grows at least as √ n, but may be much larger than that. This is illustrated in Figure 2.1, which plots the forward error for the same data as in Figure 1.1. The occasional but not rare spikes of the forward error for the [−1, 1] data show that the forward error may sometimes be much larger than √ nu. We finally mention an important consequence of this probabilistic condition number analysis in mixed-precision settings. Suppose we compute s = n i=1 x i in precision u 2 and store the result in precision u (this is similar to what GPU tensor cores units implement for matrix-matrix products [3]). Then the computed s satisfies | s − s| ≤ u|s| + O(u 2 ) and we obtain a backward error bound that is inversely proportional to the condition number. Together with (2.16), which shows that κ must increase at least as √ n for random data of zero mean, (2.17) explains why the backward error may decrease as n increases in mixed-precision settings (see, e.g., [3, Fig 3.2], which plots maxima of columnwise backward errors for matrix multiplication, which satisfy a bound of the form (2.17)).
3. Application to basic linear algebra kernels. We now apply our analysis of recursive summation to inner products and matrix-vector products.
Throughout the section, a vector or a matrix W is said to "satisfy Model 2 with mean µ and bound C" if its entries w ij satisfy the model with E(w ij ) = µ and |w ij | ≤ C, for all i and j. We also write µ |W | for the mean of the absolute values of the entries, E(|w ij |), which is the same for all i and j by assumption. When we write "let A, B, and C satisfy Model 2", this is understood to mean that all the random variables comprising the elements of A, B, and C are mutually independent. This implies that we cannot take A = B, for example.
The extension of our analysis to the inner product of two vectors x, y ∈ R n is relatively straightforward since inner products are sums of the form x T y = n j=1 x j y j . We first state the following trivial extension of Lemma 2.9.
Proof. The result is obtained by applying Lemma 2.9 to the vector w with w i = |x i y i | and using the fact that E(w i ) = E(|x i |) E(|y i |) = µ |x| µ |y| , since |x| and |y| are independent.
The backward error for an approximate inner product s is, using (2.2), Theorem 3.2. Let x, y ∈ R n and let s = x T y be computed by recursive summation. If x and y satisfy Model 2 with means µ x and µ y and bounds C x and C y then under Model 3 the computed s satisfies with probability at least P (λ) = 1 − 2n exp(−λ 2 /2). If |x| and |y| also satisfy Model 2 with means µ |x| and µ |y| then for any n such that there exists α ∈ [0, 1] for which (1 − α)µ |x| µ |y| √ n ≥ λC x C y , the backward error is bounded by with probability at least P (λ) = 1 − 2(n + 1) exp(−λ 2 /2).
Proof. Let z j = x j y j . The computed z j satisfies Let t = n j=1 z j . By Lemma 2.1, we have where T i = i j=1 z j . We cannot directly apply Hoeffding's inequality to T i since the z j may be dependent (via possibly dependent j ). However, since z j = z j + O(u), we have where W i = i j=1 z j and where the z j satisfy Model 2 with E(z j ) = µ x µ y and |z j | ≤ C x C y . Therefore, by Lemma 2.7, we obtain with probability at least P (λ) = 1 − 2n exp(−λ 2 /2). Since s = t, we use the triangle inequality and combine (3.5) with the bound |t − s| ≤ C x C y nu to obtain (3.2). To obtain (3.3), we bound |x| T |y| below by Corollary 3.1.
(a) Inner product.  Compared with the bound (2.14) for summation, the bound (3.2) on the forward error | s − s| for inner products only has an extra "+1" term, which corresponds to the initial multiplications z j = x j y j . We obtain a backward error bound (3.3) of order O(u), instead of O( √ nu) when the entries of either x or y have small mean. We illustrate this result with a numerical experiment in Figure 3.1a, which confirms that as long as at least one of the two vectors has small mean then the backward error does not grow with n.
We now turn to matrix-vector products, which are straightforward to analyze since they simply consist of multiple inner products. The backward error for an approximation y to a matrix-vector product Ax is (3.6) ε bwd ( y) := min ε > 0 : where the last equality follows from the Oettli-Prager theorem [7,Thm. 7.3], [18].
Theorem 3.3. Let A ∈ R m×n , x ∈ R n , and y = Ax. Assume A and x satisfy Model 2 with means µ A and µ x and bounds C A and C x . Also assume |A| and |x| satisfy Model 2 with means µ |A| and µ |x| . Under Model 3, for any n such that there exists α ∈ [0, 1] such that (1 − α)µ |A| µ |x| √ n ≥ λC A C x , the backward error of the computed y satisfies with probability at least P (λ) = 1 − 2m(n + 1) exp(−λ 2 /2).
We reach the same conclusions regarding the backward error for computing matrixvector products as for inner products, the only difference being that the probability of failure of the bound is larger by a factor m. We obtain a backward error bound of order u rather than order √ nu if all the entries of either the vector x or the matrix A have small mean. This result is confirmed by the numerical experiment in Figure 3.1b.
Finally, we give a result for matrix multiplication. The proof is a direct application of Theorem 3.2, as in Theorem 3.3.
Theorem 3.4. Let A ∈ R m×n and B ∈ R n×p satisfy Model 2 with means µ A , µ B and bounds C A , C B , and let C = AB. Under Model 3, the computed C satisfies with probability at least P (λ) = 1 − 2mnp exp(−λ 2 /2).
4. Reducing the backward error by reducing the data mean. Corollary 2.10 shows that under suitable assumptions the backward error for summing n numbers x j of mean µ x is, with high probability, of order √ nu when µ x = 0 but only of order u when µ x = 0. It is therefore natural to ask whether this property can be exploited to make computations more accurate: by reducing the mean of the data, can we reduce the backward error? Consider, for instance, the following simple idea: given n summands x j of mean µ x = 0, define y j = x j − µ x and compute (4.1) s = n j=1 y j + nµ x .
Since the y j have zero mean by construction, we may hope to reduce the backward error by computing s with (4.1) rather than classical recursive summation. Proving that the backward error is indeed reduced from O( √ nu) to O(u) when the x i satisfy Model 2 and Model 3 is satisfied is the object of section 4.1.
The cost of computing (4.1) is, however, significant, since it requires n + 2 additional flops to compute the n subtractions x j − µ x and the final addition and multiplication with nµ x , to which must be added another n flops for computing µ x if it is not known. It is therefore roughly two to three times more expensive to compute s by (4.1) than by standard recursive summation. This makes the algorithm unattractive for low precisions, since simply using a higher precision would typically be cheaper. However, as we explain in section 4.2, the same idea can be generalized to matrix multiplication, and in this case the overhead of transforming the sums arising in the computation into zero mean sums becomes asymptotically negligible.
4.1. Analysis for recursive summation. Algorithm 4.1 computes the sum of n numbers x j of nonzero mean µ x . As mentioned above, we may expect this algorithm to yield a smaller backward error than recursive summation under Models 2 and 3. We will now prove that this is indeed the case. For the analysis, we assume that the x j satisfy Model 2 and, for simplicity, we also assume that µ x is known exactly, although if we have only an approximate µ x = µ x + O(u) then the analysis below is essentially unaffected. y j = x j − µ x 3: end for 4: t = n j=1 y j % By recursive summation. 5: s = t + nµ x Each computed y j on line 2 satisfies Let t = n j=1 y j . In the same vein as the proof of Theorem 3.2, we have by Lemma 2.1 where T i = i j=1 y j , and therefore where W i = i j=1 y j and where the y j satisfy Model 2 with E(y j ) = µ y = 0 and |y j | ≤ C y = C x + |µ x |. By Lemma 2.7 we obtain the bound with probability at least P (λ) = 1 − 2n exp(−λ 2 /2). Since µ y = 0 by construction, we therefore obtain Finally, the computed s on line 5 satisfies, by (2.1), Combining (4.2), (4.4), and (4.5) gives From (2.2) and Lemma 2.9 we obtain the following backward error result. x j and let s be the computed sum from Algorithm 4.1. If x satisfies Model 2 and |x| also satisfies Model 2 then, under Model 3, for any n such that there exists α ∈ [0, 1] such that (1−α)µ |x| √ n ≥ λC x , the backward error bound holds with probability at least P (λ) = 1 − 2(n + 1) exp(−λ 2 /2).
Theorem 4.1 confirms that the proposed Algorithm 4.1 achieves a backward error bound that is independent of n to first order, potentially reducing the error by several orders of magnitude. We now apply this promising idea to matrix multiplication.

4.2.
Application to matrix multiplication. Let A ∈ R m×n and B ∈ R n×p . We denote by e n ∈ R n the vector of ones.

Algorithm 4.2 Let
A ∈ R m×n and B ∈ R n×p . This algorithm computes C = AB by transforming A to a matrix A with rows of zero mean.
Algorithm 4.2 performs the matrix-matrix product C = AB in such way that most of the inner products arising in the computation involve a vector with zero mean. The key idea is to perturb A by a rank-1 matrix xy T . Several choices of x and y can be considered; a natural choice for x i is the mean of the ith row of A and for y the vector of ones, so that all rows of A = A − xy T have mean zero. Then computing the product C ← AB amounts to computing mp inner products a T i b j , where a T i is the ith row of A and has mean zero. The desired result is recovered by computing C = C + x(y T B). Note that we could alternatively perturb the columns of B to have zero mean.
The One crucial point is that the computation of y T B leads to an accumulation of n rounding errors per entry of y T B. If y is a vector of nonzero mean (as is the case with the choice described above), and since in general the columns of B also have nonzero mean, computing y T B conventionally will negate any benefit obtained by perturbing A. The simplest strategy, which will be adopted throughout this section, is to compute y T B in extended precision. In the case where this strategy is not possible (because no higher precision is available) or not desirable (because the use of a higher precision would lead to a severe loss of performance), one should use an accurate algorithm to compute y T B, such as Kahan's compensated summation [7,Alg. 4.2], [15]. The flop overhead of such an accurate algorithm remains negligible as long as m 1.
The following theorem is a straightforward extension of Theorem 4.1 for Algorithm 4.2. Compared with Theorem 3.4, which includes an n 3/2 term, Theorem 4.2 provides a bound linear in n.
Theorem 4.2. Let A ∈ R m×n and B ∈ R n×p satisfy Model 2 with bounds C A , C B . If C = AB is computed using Algorithm 4.2, where y T B (line 4) is computed using precision u 2 , then under Model 3 the computed C satisfies with probability at least P (λ) = 1 − 2mnp exp(−λ 2 /2).
Proof. By construction µ A = O(u). Let W denote the computed C. By Theorem 3.4, with probability at least P (λ) = 1 − 2mnp exp(−λ 2 /2), and C A ≤ 2C A . Then the computed C satisfies where the three terms in the right-hand side correspond to the errors produced by computing z = y T B in precision u 2 , w = x z in precision u, and W + w in precision u, respectively. Therefore we have The proof follows using the triangle inequality:

4.2.1.
Numerical experiments on random dense matrices. In Figure 4.1, we report some numerical experiments for computing C = AB, where A ∈ R m×n and B ∈ R n×p are randomly generated with uniform [0, 1] entries. We set m = p = 32 and compare the error growth for Algorithm 4.2 and the classical matrix multiplication algorithm for increasing n. We measure the error by The matrix product C = AB is performed in the working precision, which is single precision or half precision, in Figures 4.1a and 4.1b, respectively. All other computations, which require a negligible amount of flops, are performed in double precision (importantly this includes the computation of y T B, as previously explained). For half precision we use the IEEE fp16 format, simulating it using the chop function of Higham and Pranesh [9]. The "exact" C, used to evaluate the error (4.8), is computed using double precision.
The results show that the new algorithm does not suffer from the error growth of the classical algorithm and can deliver an error of order u, regardless of the size (a) Single precision.  of the matrices. In contrast, the classical algorithm leads to an error rapidly growing as n increases. For moderate values of n, this growth is proportional to √ nu as predicted by Theorem 3.4. For larger values of n, the error starts growing more rapidly, proportionally to the worst-case bound nu. This is due to a phenomenon called stagnation, which we observed and explained in [8, sec. 4.2.1]. Stagnation occurs when the value of a partial sum becomes so large that subsequent summands do not increase its floating-point value. This leads to rounding errors that are necessarily of negative sign, which violates our probabilistic model (Model 3). In our experiments, Algorithm 4.2 is able to avoid stagnation by reducing the mean of the computed sums, hence preventing them from reaching such large values.
Interestingly, the error for the classical algorithm in single precision (Figure 4.1a) eventually (at about n ≥ 10 6 ) becomes larger than that for the new algorithm in half precision (Figure 4.1b). Thus, for large n, we expect the new algorithm can be both more accurate and faster than the classical algorithm.

4.2.2.
Comparison with other error-reducing algorithms. We now compare both in terms of cost and accuracy our new algorithm with other existing approaches targeting an improved accuracy.
A well-known algorithm to improve the accuracy of summation is compensated summation, which can be applied to matrix multiplication and yields an error ε( C) in (4.8) bounded by 2u + O(u 2 ). Figure 4.2 shows that for matrices with entries sampled uniformly from [0, 1], Algorithm 4.2 achieves a comparable error to that of the compensated algorithm, despite being much less expensive. Indeed, compensation requires at least 2.5 times as many flops as the classical algorithm.
We recently proposed a class of blocked summation algorithms called FABsum that aims to achieve a compromise between accuracy and performance [4]. The idea at the heart of FABsum is to first compute local sums of b numbers in a fast way, where b is a block size that should not grow with n; then the total sum is computed by summing the blockwise sums in an accurate way, such as using compensation. This method leads to an error bounded by bu and has an overhead of only O(mnp/b) flops, which can be (a) Single precision.  In terms of accuracy, Figure 4.2 shows that FABsum and Algorithm 4.2 are comparable for large n for these random matrices; for smaller n, one must choose a small enough b for FABsum to rival Algorithm 4.2. Note however that Algorithm 4.2 specifically targets random matrices. For general matrices, FABsum should be preferred because its worst-case error bound bu holds for any data without any assumptions.

Conclusion.
We have performed a new backward error analysis for basic numerical linear algebra kernels that combines a probabilistic model of the rounding errors with a second probabilistic model of the data. Our analysis gives a theoretical explanation of the strong dependence of the backward error on the values of the data that was previously observed in [8] and can be seen in [1], [3], [4], [5], [14], [20]. We showed that for data with zero or small mean, the probabilistic backward error bound √ nu from [8] can be relaxed to cu, where c is a constant independent of n. Our analysis covers summation, inner products, matrix-vector products, and matrix-matrix products. For all these kernels applied to random data, our analysis accurately predicts the growth of the backward error in terms of the means of the entries of the matrices and vectors arising in the computation.
Motivated by these findings, we proposed transforming the data to have zero mean, so as to benefit from the more favorable probabilistic error bounds. We implemented this idea for matrix multiplication, for which the transformation overhead is asymptotically negligible. We found that our new algorithm can indeed reduce the error bound by a factor √ n for random matrices, rivalling some state-of-the-art error-reducing algorithms (such as FABsum [4]) in terms of accuracy, at a potentially lower cost.
In future work, we will investigate the extension of our analysis to LU factorization and the solution of linear systems. This is not straightforward, because for these kernels the data cannot be assumed to be independent as in Model 2. Consider, for example, the solution of a lower triangular system Lx = b by substitution. The ith component of the solution x is given by j=1 ij x j / ii , and the components x j are dependent. Theorem 2.8 therefore cannot be applied directly to the summation term.