Stupid is as Stupid Does: Taking the Square Root of the Square of a Floating-Point Number

Floating-point experts know that mathematical formulas may fail or give imprecise results when implemented in ﬂoating-point arithmetic. This article describes an example where, surprisingly, it is absolutely not the case. Indeed, using radix 2 and an unbounded exponent range, the computation of the square root of the square of a ﬂoating-point number a is exactly | a | . A consequence is the fact that the ﬂoating-point computation of a/ (cid:112) ( a 2 + b 2 ) is always in the interval [ − 1 , 1]. This removes the need for a test when calling an arccos or an arcsin on this value. For more guarantees, this property was formally checked using the Coq proof assistant and the Flocq library. The conclusion will give hints on what happens without assumptions and in other radices, where the behavior is very diﬀerent.


Introduction
Floating-point (FP) arithmetic is seen as intricate because too few people have sufficient knowledge to understand how it works.For people having been only trained with mathematics, facts such that (x+y)+z may be different from x+(y+z) for certain values or the fact that there exists x such that x = 0, but x2 = 0 is beyond comprehension.This is the reason why mathematical formulas are most of the time programmed as they stand in mathematical textbooks even if this may not be a good idea [1].The main shortcoming is the limited precision: as the exact mathematical result cannot be exactly represented, it has to be rounded to the nearest floating-point values (other roundings are available [6] but are seldom used, except in interval arithmetic).

Boldo
An example where it might happen is the following one.Given a right triangle with sides a, b and the hypotenuse h, we want to compute one of the angle θ, that is such that sin(θ) = a/h = a/ (a 2 + b 2 ).Therefore, to compute θ, one has to take the arcsin of a This may be a problem: the arcsin function assumes an input between −1 and 1, but this FP computation may be incorrect.How can we be sure that the arcsin computation will not fail?A simple solution is to compute in FP arithmetic where • is a FP rounding.If x > 1, we return 1.If x < −1, we return −1.Then, we are sure that the result will be between −1 and 1 and that the call to arcsin will not fail.Another more complicated solution (if possible) is to prove that x will always be between −1 and 1, hence getting rid of the tests.This is a substantial performance benefit: when the test is not correctly guessed, the pipeline has to be flushed, which is very costly.This is the goal of this article.It is easy to see that the worst case corresponds to b = 0 where x reduces to . It is sufficient (but not necessary!)to prove that • • (a 2 ) = |a| as it implies that x will be either 1 or −1.This is the reason we will study what happens when taking the square root of the square of a floating-point number.This is not a new problem: Cody and Waite use this property as a test for the square root function [4] (pages 12 and 28).It is also stated in one of Kahan's web papers [5] (page 29).There is no detailed proof in any of there references.In particular, the minimal precision is never explicit.
More than a pen-and-paper proof, this work gives a high guarantee of its correctness and gives a precise hypothesis on the needed precision.We will rely on the Coq proof assistant.From the formal methods point of view, we will base our proof on the Flocq library [3].Flocq is a formalization in Coq that offers a multiradix and multi-precision formalization for various floating-and fixed-point formats (including FP with or without gradual underflow) with a comprehensive library of theorems.Its usability and practicality have been established against test-cases [2].
We will here assume we are using rounding to nearest, ties to even in radix 2, denoted by •, with a precision p > 1.In particular, the square root is assumed to be computed correctly.Note that p > 1 will be the only requirement on the precision in the whole paper.We will also assume that there is neither underflow nor overflow: the exponent range is unbounded.This corresponds to the FLX format of the Flocq library [3].What happens using a bounded exponent range is described in the conclusion.The proof is available in the example sub-directory of the Flocq library from version 2.3.0 available at http://flocq.gforge.inria.fr/.

Boldo
This article is organized as follows: Section 2 describes some intermediate rather generic lemmas.Section 3 describes the main results.Section 4 concludes and gives hints on what happens in practice and with radices other than 2.

Intermediate Lemmas
The ulp (unit in the place) denotes the value of the last bit of the mantissa of a FP number or the corresponding value for a real number [6,3].
Lemma 1 (round le half an ulp) Let v be a real and u be a positive FP number.Assume that v < u + ulp(u)/2, then •(v) ≤ u.
Proof.It is already proved in Flocq that the generic rounding to nearest (with any tie-breaking rule) is monotone, that is to say, if x < y, then • 1 (x) ≤ • 2 (y) whatever the two tie-breaking-rules.As v < u + ulp(u)/2 and if we consider the even tie-breaking rule for v and the tie-breaking rule down (or towards zero) for u + ulp(u)/2, we get •(v) ≤ u, as u + ulp(u)/2 is the midpoint between u and its successor. 2 Lemma 2 (round ge half an ulp) Let v be a real and u be a positive FP number.First assume that u is not a power of 2. Assume also that u This inequality is more complicated than the previous one, with a proof about four times bigger.

Proof.
A key point is the fact that u is not a power of the radix.Then, its predecessor has the same exponent, and the same ulp.This fact takes about half of the proof.It relies on the fact that the FP format is floating-point with an unbounded exponent range.
The rest of the proof is similar to the previous one, relying on the monotony of rounding to nearest, whatever the ties.An additional difficulty is the need to prove that the predecessor of u is positive, which is also a consequence of the unbounded exponent range. 2 The last lemma of this section may seem strange, but it is easier to prove it before the main theorem.Moreover, this is the only place where we rely on the fact that the radix is 2.
Lemma 3 (ulp sqr ulp lt) Let u be a positive real number.Then Proof.There is no subtlety here.Let e be such that ulp(u) = 2 e−p and i such that ulp u 2 = 2 i−p .then we either have i = 2 e − 1 or i = 2 e.We then split into the two cases, handle substitutions and inequalities and prove the results holds in both cases. 2

Boldo
Proof.The proof is quite simple from the previous lemmas:

Conclusion and limits
We have proved that, using radix 2 and a precision greater than 1, the rounded computation of √ a 2 is indeed |a|.This leads to the fact that the computation of will always be between -1 and 1, so that further computations will not fail.
A problem ignored here is that of limited range for the exponent.Contrarily to ( √ a) 2 , the computation of √ a 2 may underflow and overflow.For very small positive a, the result will be 0 and for big finite a, it will be +∞.This means this can be detected afterwards.Moreover, the preceding result is valid only for medium-range a, that is to say |a| is about between 2 −511 and 2 511 in the binary64 format.Scaling may be necessary.
Another question is if this still holds when using rounding to nearest, ties away from zero [6].The answer is yes as no mid-point behavior is used.The bigger problem is that, if we want to express this property using any tie on any rounding, we must quantify on the 5 roundings that appear in the last theorem, which is cumbersome.If directed roundings are used, the results do not hold anymore.

Other radices
It seems that x = • • (x 2 ) still holds in any radix when the mantissa of x is small.But we have examples where x > • • (x 2 ) , even by several ulps: • With a radix β = 10 and a precision p = 4, if x = 31.66,then we have • • (x 2 ) = 31.65.
All these examples are just above a multiple of √ β, hinting where the worst case may lie.The number of ulps between x and • • (x 2 ) also seems to grow when the radix is increased.

Boldo
A more interesting fact is that, even if x = • • (x 2 ) in many cases, the division returns 1 nonetheless on those cases.We were not able to find values such that the computation of is above 1.Some initial work seem to show that this holds for radices, the difficult cases being radix 3 and 5. Formal developments in Coq are in progress, with a focus on the smallest minimal precision needed.