Computation of the unit in the first place (ufp) and the unit in the last place (ulp) in precision-p base β\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta $$\end{document}

There are simple algorithms to compute the predecessor, successor, unit in the first place, unit in the last place etc. in binary arithmetic. In this note equally simple algorithms for computing the unit in the first place and the unit in the last place in precision-p base-β\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta $$\end{document} arithmetic with p⩾1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \geqslant 1$$\end{document} and with β⩾2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta \geqslant 2$$\end{document} are presented. The algorithms work in the underflow range, and numbers close to overflow are treated by scaling. The algorithms use only the basic operations with directed rounding. If the successor (or predecessor) of a floating-point number is available, an algorithm in rounding to nearest is presented as well.

the set of denormalized numbers. Then F := F N ∪ F D ∪ {0} is the set of all precisionp base-β floating-point numbers. 1 Set F * := F ∪ {−∞, ∞} and let an arithmetic on F * following the IEEE 754 standard [7,8] be given. That means in particular that in RoundToNearest all floating-point operations have minimal error, bounded by the relative rounding error unit u := 1 2 β 1− p . Moreover, different rounding modes are available, also with best possible result.
In [19] we introduced the "unit in the first place" (ufp) which is defined by 0 = r ∈ R ⇒ ufp(r ) := β log β |r | and ufp(0) := 0. For all real r ∈ R it is the value of the left-most nonzero digit 2 in the base β-representation.
In contrast, the often used "unit in the last place" (ulp) depends on the precision of the floating-point format in use. For a nonzero finite base-β string it is the magnitude of its least significant digit, or in other words, the distance between the floating point number and the next floating point number of greater magnitude [6]. 3 There are several other definitions of the unit in the last place, in particular for real r / ∈ F, cf. [2,12,13]. We use the definition above, namely ulp(r ) = β e for r ∈ F N according to (1.1), ulp(r ) = β E min for r ∈ F D , and ulp(0) = 0. All definitions have in common that they depend not only on the basis β but also on the precision of the floating-point format in use.
We invented the unit in the first place in [19] because it was very helpful if not mandatory to formulate complicated proofs of the validity of our new floating-point algorithms for accurate summation and dot products. We developed a small collection of rules using ufp, so that based on that no further understanding of the many properties of the IEEE 754 floating-point arithmetic was necessary to follow the proofs.
The main difference in the definition of ulp compared to ufp is to separate the use of the basis and of the precision. First, ufp is defined for a general real number, only depending on the basis β, and second precision-related results use ufp and the relative rounding error unit, i.e., the precision p. That separation was useful to formulate our proofs in [19] and following papers.
There are simple algorithms to compute the predecessor, successor, unit in the first place, unit in the last place etc. in binary arithmetic [3,5,10,11,13,14], but apparently no method is known to compute the unit in the first place in a base-β arithmetic. Jean-Michel Muller [15] proposed a method based on the results in [9], however, it needs up to log 2 (β) iterations. We are interested in a flat, loop-free algorithm with few operations.
Recently we wrote a toolbox for an IEEE 754 precisionp base-β arithmetic with specifiable exponent range [18] as part of INTLAB [16], the Matlab/Octave toolbox for reliable computing. As part of this we present in this note a simple algorithm to compute the unit in the first place for a precisionp base-β arithmetic with p 1 and β 2. The algorithm works correctly in the underflow range, where numbers close to overflow are treated by scaling. That algorithm requires a directed rounding, i.e., RoundToZero, RoundUp or RoundDown; we could not construct a simple algorithm in RoundToNearest.
In addition, as a reply to suggestions by the referees, we present some additional algorithms to compute ufp and ulp. Those require a specific directed rounding mode and/or access to the predecessor/successor of a floating-point number. Since these algorithms are pretty obvious and the proofs of correctness are trivial, we banned them into the appendix.
We formulate our algorithm to compute the unit in the first place in the rounding mode RoundToZero and call the corresponding mapping fl : R → F. It follows that the result of a floating-point operation with positive real result x is max{ f ∈ F : f x}, and that operations cannot cause overflow.
The predecessor and successor of x ∈ R in F * is defined by respectively. In precisionp base-β arithmetic we have Note that (1.5) includes the case p = 1, β = 2 for which F is the set of powers of 2, i.e., f = ufp( f ) for all f ∈ F, and succ( f ) = f + β 1− p ufp( f ) = 2 f . Among the properties of ufp [19] is Next we present in Fig. 1 our algorithm to compute ufp( f ) for f ∈ F in precisionp base-β arithmetic and RoundToZero or RoundDown. It is obvious how to adapt the algorithm for RoundUp. We assume that subrealmin, the smallest positive denormalized floating-point number equal to β E min , is available. Overflow is easily avoided by proper scaling, but we omit that technical detail. Note that in a practical implementation, the constants p1 and phi in lines 2 and 3 of Algorithm ufp would be stored rather than calculated, and the extra input parameters p and beta would be omitted.  The usual problems in the denormalized range are avoided because q ∈ F N , so that the multiplication in line 5 are in the normalized range. The result of the final subtraction may be in the denormalized range but is error-free because of Sterbenz' lemma [21].
Proof The result is correct for f = 0, so henceforth we assume f = 0. We first verify that the used constants p1 and phi are in F. The rounding RoundToZero or RoundDown implies that p 1 in line 2 is the predecessor of 1, and (1.3) and E min −1 yield p 1 = 1 − β − p . Moreover, ϕ ∈ F follows by β p−1 + 1 β p β E max . Note that this includes the case ϕ = 2 for p = 1.
The input f is used only in line 4, and since ufp( f ) = ufp(| f |) we may henceforth assume without loss of generality that f > 0. The monotonicity of the rounding, (1.6) and (1.5) imply so that the rounding mode implies q = fl (ϕ f ) β p ufp( f ). Therefore, Hence q is always in the normalized range F N and f < β E max − p+1 yields ufp( f ) β E max − p and q β p ufp( f ) β E max . We distinguish two cases. First, assume ufp(q) = β p ufp( f ), which implies that q = β p ufp( f ) is a power of β. Then q β p β E min > β E min and (1.3) yield The theorem is proved.
Algorithm ufp will part of the flbeta toolbox in INTLAB. Executable INTLAB code, which is almost identical to the one given in Fig. 1, is shown in Fig. 2, Here flbeta is a user-defined data type, where the precision p 1, the base β 2 as well as the exponent range (E min , E max ) can be specified through initialization by flbetainit. As in every operator concept, an operation is executed in flbetaarithmetic if at least one of the operands is of type flbeta. The flbeta toolbox respects the rounding mode; in line 2 it is switched to RoundToZero using the internal Matlab command feature.
The result of p = flbetainit as in line 3 without input and with one output argument is the precision p in use. The constructor flbeta(m,e) generates the flbeta constant mβ e . Otherwise the code is self-explaining.
Finally we want to mention that the flbeta toolbox was very useful for testing in different precisions p, different bases β and exponent ranges E min , E max . Frankly speaking, we found Algorithm ufp experimentally when playing around with different possibilities. However, we did not find a simple algorithm in the nearest rounding RoundTiesToEven.
We close the main part of this note with some open problems. As has been mentioned, we did not succeed to find a simple algorithm to compute ufp solely in rounding to nearest. Here "simple" means few operations without loop. Problem 1.1 Given a precisionp base-β arithmetic following IEEE 754, find a simple algorithm to compute the unit in the first place (ufp) in rounding to nearest.
The problem is solved [17] in binary for p 1.

Problem 1.2
Given a precisionp base-β arithmetic following IEEE 754, find a simple algorithm to compute the unit in the last place (ulp) in rounding to nearest.
Concerning units of a floating-point number, there is a third quantity of interest, namely, the magnitude of the least nonzero digit in a finite base-β representation. Historically [15], Shewchuk [20] uses this quantity implicitly for defining his "nonoverlapping expansion", with the notation ω( f ) it appears in [4], and in [1] the notation uls( f ) (unit in the least significant place) is used. For example, in a precision-3 decimal arithmetic and f = 42 we have ufp( f ) = 10, ulp( f ) = 0.1 and uls( f ) = 1.

Problem 1.3
Given a precisionp base-β arithmetic following IEEE 754, find a simple algorithm to compute the unit in the least significant place (uls) in any rounding mode.

Appendix
We add some more algorithms to compute ufp and ulp requiring specific rounding modes, namely RoundDown, RoundUp and/or RoundToNearest.
To compute ulp( f ) in RoundUp [or RoundDown] we can just follow the definition The result is correct in precisionp base-β floating-point arithmetic for any nonzero floating-point number f with | f | < (β p − 1)β E max , i.e., with absolute value not equal to the largest representable floating-point number realmax. If there is a possibility to obtain the successor of a floating-point number, then replacing line 3 by s = succ(f); produces correct code in any rounding mode because the computation in line 4 is error-free (Fig. 3).
In RoundDown or RoundToZero, a little more effort is necessary to compute ulp( f ). The algorithm in Fig. 4 works for vector or matrix input as well. That is, by the way, also true for the previous algorithms.  The computed S in line 4 is correct for positive f ∈ F except powers of β in the normalized range. Otherwise, S is corrected in lines 5-8. The result is correct for Sometimes if-statements may cause quite some computational overhead. The algorithm in Fig. 5, working for nonzero f ∈ F with | f | < realmax, closes that gap. If f is not a power of β, then f + S is the successor of f, so that d = 0 in line 5. Hence S is not changed in line 6. Otherwise, if f is a power of β, then S = ulp(f)/beta and f + S is equal to f in floating-point in the chosen rounding modes. Hence d = -S = -ulp(f)/beta, and the computed S is corrected into ulp( f ) because d is a power of β and the computation in line 6 is error-free. However, the algorithm in Fig. 5 is about twice as slow as the previous one in Fig. 4.
Finally, if there is a possibility to obtain the successor of a floating-point number, then ufp can be calculated in any rounding mode by the algorithm in Fig 6. The algorithm works, as for Theorem 1.1, correctly for nonzero f ∈ F satisfying | f | < (β p − 1)β E max − p+1 except that f must be nonzero.
That means the posed Problem 1.1 to find a simple algorithm to compute the unit in the first place in RoundToNearest is solved if such an algorithm for the successor is available. In [17] we presented a simple algorithm for binary arithemtic but only estimates for precisionp base-β.