Estimation of numerical reproducibility on CPU and GPU

—Differences in simulation results may be observed from one architecture to another or even inside the same architecture. Such reproducibility failures are often due to different rounding errors generated by different orders in the sequence of arithmetic operations. Reproducibility problems are particularly noticeable on new computing architectures such as multicore processors or GPUs (Graphics Processing Units). DSA (Discrete Stochastic Arithmetic) enables one to estimate rounding error propagation in simulation programs. In this paper, it is shown that DSA can be used to estimate which digits in simulation results may be different from one environment to another because of rounding errors. A particular implementation of DSA, which enables numerical validation in hybrid CPU-GPU environments, is described. The estimation of numerical reproducibility using DSA is illustrated by a wave propagation code which can be affected by reproducibility problems when executed on different architectures.


I. INTRODUCTION
R ESULTS of numerical simulations may be different from one architecture to another, or even inside the same architecture if they are computed using different compilers for instance.In sequential or parallel environments, different orders in the sequence of floating-point operations may lead to differences in rounding error propagation and therefore to reproducibility failures.It must be pointed out that the cause of differences in results may be difficult to identify: rounding errors or bug?Such differences are particularly noticeable with new computing architectures such as multicore processors, GPUs (Graphics Processing Units) and APUs (Accelerated Processing Units).In high performance numerical simulations, reproducibility problems have been identified in various domains: energy science [1], climate science [2], atomic or molecular dynamics [3], [4], fluid dynamics [5].Various studies have been carried out on numerical reproducibility on different architectures.On the one hand, strategies have been proposed [2], [3], [4], [5] to improve numerical accuracy, using for instance accurate summations.Other works aim at forcing the reproducibility of results, either affected by the same rounding errors [6], [7] or correctly rounded [8], [9], [10], [11].
DSA (Discrete Stochastic Arithmetic) [12], [13] enables one to estimate rounding error propagation in simulation programs.This paper shows that DSA can be used, not to force a code to be reproducible, but to estimate the number of digits in the results which may be different from one execution to another because of rounding errors.The CADNA1 library [14], [15], [16], which implements DSA, enables the numerical quality estimation of sequential programs in C or Fortran and of parallel programs using MPI for communication [17].This paper describes a version of CADNA, briefly introduced in [18], that can be used in hybrid CPU-GPU environments to estimate rounding errors in CUDA programs.This paper is organized as follows.In Section 2, differences in results provided by a wave propagation code executed on several architectures -CPU, GPU and APU -are pointed out.Section 3 presents the principles of DSA.Section 3 also describes the CADNA library and presents the particularities of a CADNA version for CPU-GPU codes.Section 4 shows that the reproducibility problems observed in wave propagation results can be explained by rounding error propagation thanks to the CADNA library.Finally, concluding remarks are presented in Section 5.

CODE
We consider the three-dimensional acoustic wave equation where u is the particle velocity, c is the wave velocity, and t is the time.This equation, used for instance in oil exploration [19], is solved with an explicit finite difference scheme of order 2 in time and p in space (in our case p = 8).We denote by u n i,j,k (respectively f n i,j,k ) the wave (respectively source) field in (i, j, k) coordinates and nth time step, a l (l = −p/2, . . ., p/2) the finite difference coefficients, ∆t the time step and ∆h the spatial step size.Two mathematically equivalent implementations of the finite difference scheme are proposed: where or a l u n i,j,k+l .
(2) In order to satisfy the CFL (Courant-Friedrichs-Lewy) necessary stability condition [20], the time step is computed by taking into account the wave velocity c, the spatial step size ∆h and the spatial order p.Because these two implementations require the same number of arithmetic operations, they should lead to similar performance.However it would be interesting to determine whether they differ in the numerical quality of their results.
The code is executed for 64 × 64 × 64 space steps and 1000 time iterations in IEEE-754 binary32 arithmetic with rounding to the nearest [21] and the following environments: OpenCL.Different kinds of reproducibility problems are observed.The results numerically vary 1) from one execution to another inside a GPU or an APU; these repeatability problems are due to differences in the execution order of the threads; 2) from one implementation of the finite difference scheme to another; the maximal relative difference between results is of the order of 10 −1 to 1 depending on the architecture, and the mean value of the relative difference between results is of the order of 10 −5 whatever the architecture; 3) from one architecture to another; again, the maximal relative difference between results is of the order of 10 −1 to 1 and its mean value is 10 −5 .Indeed if two sets of results computed in binary32 are compared, the results at the same space coordinates can have from 0 to 7 significant digits in common, and the average number of common significant digits is about 4. We recall that results computed using binary32 arithmetic precision can have at most 7 correct significant digits.To illustrate these reproducibility problems, Table I presents at three space coordinates (i, j, k), 0 ≤ i, j, k ≤ 63, the results obtained after 1000 times iterations using the processing units and languages previously mentioned.These results have different orders of magnitude.Both implementations of the finite difference scheme are considered.Considering the example points presented in Table I, any two results computed at the same point in the space domain have 3 to 6 common significant digits.

A. Principles of DSA
Based on a probabilistic approach, the CESTAC method [12] allows the estimation of rounding error propagation which occurs with floating-point arithmetic.When no overflow occurs, the exact result of any non exact floating-point arithmetic operation is bounded by two consecutive floating-point values R − and R + .The basic idea of the method is to perform each arithmetic operation N times, randomly rounding each time, with a probability of 0.5, to R − or R + .The computer's deterministic arithmetic, therefore, is replaced by a stochastic arithmetic where each arithmetic operation is performed N times before the next one is executed, thereby propagating the rounding error differently each time.The CESTAC method furnishes us with N samples R 1 , . . ., R N of the computed result R. The value of the computed result, R, is the mean value of {R i } 1 i N and the number of exact significant digits in R, C R , is estimated as τ β is the value of Student's distribution for N − 1 degrees of freedom and a probability level 1 − β.In practice N = 3 and β = 0.05.Indeed, it has been shown [22], [23] that N = 3 is in some reasonable sense the optimal value.The estimation with N = 3 is more reliable than with N = 2 and increasing the size of the sample does not improve the quality of the estimation.The probability of overestimating the number of exact significant digits of at least 1 is 0.054% and the probability of underestimating the number of exact significant digits of at least 1 is 29%.By choosing β = 0.05, we prefer to guarantee a minimal number of exact significant digits with a high probability (99.946%), even if we are often pessimistic by 1 digit.The complete theory can be found in [12], [23].
The validity of C R is compromised if the two operands in a multiplication or the divisor in a division are not significant [23].It is essential, therefore, that these results with no significance are detected and reported, since their subsequent use may invalidate the method.The need for this dynamic control of multiplications and divisions has led to the concept of the computational zero.A computed result is a computational zero, denoted by @.0, if ∀i, R i = 0 or C R ≤ 0. This means that a computational zero is either a mathematical zero or a number without any significance, i.e. numerical noise.
To establish consistency between the arithmetic operators and the relational operators, discrete stochastic relations are defined as follows.Let X = (X 1 , ..., X N ) and Y = (Y 1 , ..., Y N ) be two results computed using the CESTAC method, we have from [24] Discrete Stochastic Arithmetic (DSA) [12], [13] is the combination of the CESTAC method, the concept of the computational zero, and the discrete stochastic relationships.

B. Numerical validation of sequential codes using DSA
The CADNA software [14], [15], [16] is a library which implements DSA in any code written in C++ or in Fortran and allows to use new numerical types: the stochastic types.In practice, classic floating-point variables are replaced by the corresponding stochastic variables, which are composed of three floating-point values and an integer to store the accuracy.The library contains the definition of all arithmetic operations and order relations for the stochastic types.For instance, let us consider an arithmetic operation • ∈ {+, −, * , /} between two stochastic variables A and B. This arithmetic operation is performed three times on the associated floatingpoint values A i • B i , the rounding mode being randomly set to rounding towards +∞ or −∞.The control of accuracy is performed only on variables of stochastic type.Only exact significant digits of a stochastic variable are printed or "@.0" for a computational zero.Because all operators are redefined for stochastic variables, the use of CADNA in a program requires only a few modifications: essentially changes in the declarations of variables and in input/output statements.The CADNA software has been successfully used for the numerical validation of real-life applications [17], [25], [26], [27], [28].
Attention has been paid to rounding mode setting in terms of performance, because the rounding mode must be frequently changed.On CPU the rounding mode is determined by two bits in the Control Word, a 16-bit register in the FPU (Floating-Point Unit).At the beginning of a program using CADNA, the rounding mode is arbitrarily set to −∞.Then the rounding mode is randomly changed using the CADNA rnd_switch function that switches the rounding mode from +∞ to −∞, or from −∞ to +∞.To reduce the cost of rounding mode changes, the rnd_switch function is written in assembly language and is specific to the processor and the compiler chosen.The rnd_switch function changes in the FPU Control Word the two bits associated with the rounding mode.For performance reasons, a random number generator is not called at each arithmetic operation.A long random sequence is generated at the beginning of the program and stored in an array.Then successive array elements are used cyclically when random numbers are required.CADNA can detect numerical instabilities which occur during the execution of the code.When a numerical instability is detected, dedicated CADNA counters are incremented.At the end of the run, the value of these counters together with appropriate warning messages are printed on standard output.These warnings are of two types.1) Warnings related to the self-validation of CADNA.
These include: unstable multiplication where the two operands are computational zeroes and unstable division where the divisor is a computational zero.These warnings indicate that the validity of C R has been compromised and the CADNA results cannot be relied on.2) Warnings concerning other numerical instabilities.These instabilities can occur in overloaded mathematical functions or in branching statements involving a computational zero.A numerical instability is also reported in the case of a cancellation, i.e. the subtraction of two very close values which generates a sudden loss of accuracy.
At the end of the run, each type of anomaly together with their occurrences are printed.If no anomaly has been detected the computed results are reliable and the accuracy of each has been correctly estimated up to a certain probability.Otherwise the messages need to be analysed, the source of the anomaly identified and, if necessary, the code changed.The user can specify the instabilities to be detected.One may choose, for instance, to activate only self-validation, to detect all types of instabilities or to deactivate the detection of instabilities.

C. Numerical validation of hybrid CPU-GPU codes using DSA
An asynchronous implementation of the CESTAC method for the estimation of rounding errors in GPU codes written in CUDA is proposed in [29].Unfortunately with such an implementation of the CESTAC method, the whole code is executed several times with the random rounding mode and no instability in arithmetic operations can be detected.We present here a version of CADNA which implements DSA for the numerical validation of hybrid CPU-GPU codes written in On GPU an arithmetic operation can be performed with a specified rounding mode.For instance a multiplication with rounding towards +∞ can be executed using the fmul_ru function and a multiplication with rounding towards −∞ using the fmul_rd function.Therefore a stochastic operation on GPU implies three floating-point operations randomly rounded towards +∞ or −∞ using the appropriate arithmetic function.Unlike on CPU, random numbers are not stored in a global array on GPU, because it is incompatible with GPU programming paradigms.Each GPU thread being independent, it generates random numbers using its own seed and taking into account its own thread index.On GPU, random numbers are generated when they are required, i.e. during the stochastic operations.As a remark, because of the robustness of accuracy estimation by DSA [22], the quality of the random number generator is not a critical issue: only boolean values are required.
On CPU, numerical instabilities that occur during the execution are counted.Such a count is not performed on GPU because it would consume shared memory and require many atomic operations.On GPU an unsigned char is associated with each result to store the numerical instabilities that have affected it.Each bit of this char is associated with a type of instability.For instance, its last bit is set to 1 if the result has been affected by at least one unstable multiplication.Therefore in the CPU-GPU version of CADNA a stochastic variable is composed of three floating-point variables and four unsigned char: one for the accuracy, one for the instabilities and two for padding to respect memory alignment.Instability detection increases the execution time with CADNA.Cancellation detection is particularly costful because it requires to compare for all additions and subtractions the operands accuracy with the result accuracy.For performance reasons, the main instabilities can be detected with the GPU version of CADNA: the instabilities related to the self-validation of CADNA (unstable multiplications and unstable divisions) and the unstable branching statements.

IV. ESTIMATION OF REPRODUCIBILITY IN WAVE
PROPAGATION RESULTS BY MEANS OF DSA The acoustic wave propagation code has been executed with the CADNA library on CPU and on GPU.Results presented in this section have been computed in the following environments: • an AMD Opteron 6168 CPU with gcc 4.7.2 compiler; • an NVIDIA C2050 GPU with CUDA 5.0 platform.With implementations (1) and (2) of the finite difference scheme, the number of exact significant digits in the results computed with CADNA varies from 0 to 7. On CPU its mean value is 4.06 with both schemes; on GPU it is 3.43 with scheme (1) and 3.49 with scheme (2).These remarks are consistent with the observations described in Section II.Numerical instabilities occur during the execution: 272,394 losses of accuracy due to cancellations with scheme (1) and 285,186 with scheme (2).This kind of instability is detected by the CPU version of CADNA if the subtraction of two close floating-point numbers leads to a loss of accuracy of at least 4 digits.
Table II presents results obtained on CPU and on GPU at the same points in the space domain as in Table I.These results have been computed, on the one hand, using CADNA and, on the other hand, using IEEE floating-point arithmetic with rounding to the nearest.With CADNA, only the exact significant digits, i.e. the digits not affected by rounding errors, are printed.Results in the first four rows in Table II have been computed using binary32 arithmetic precision.Results in the last row have been obtained in binary64 on CPU with CADNA.Although CADNA prints 11 to 14 exact significant digits in these three results, only their first 10 digits are reported in Table II.The number of exact significant digits estimated by CADNA depends on the point considered in the space domain.As already mentioned in Section III, accuracy estimation by DSA is rather pessimistic than optimistic.Because of the probabilistic aspect of DSA, the number of exact significant digits estimated by CADNA may slightly differ on CPU and on GPU.In Table II   Figures 1 and 2 present, respectively on CPU and on GPU, the number of exact significant digits estimated by CADNA in the results computed with scheme (1) with respect to their absolute values.Similar results are observed with the other scheme.The highest results (in absolute value) are affected by low rounding errors and the highest rounding errors impact negligible results.Although the same trend can be observed on CPU and on GPU, there are differences between the two distributions due to the probabilistic aspect of DSA.Depending on the point in the space domain, the number of exact significant digits may be higher on CPU or on GPU.The average difference between results accuracy on CPU and on GPU is 0.6 digit.Furthermore because of differences in the execution order of the threads, the accuracy distribution may be slightly different from one execution to another on GPU.Table III presents execution times of the acoustic wave propagation code in the environments mentioned at the beginning of this section.Because the execution times measured with implementations (1) and ( 2) of the finite difference scheme are similar, only the performance of implementation ( 1) is reported in Table III.The code has been run in binary32 both on CPU and on GPU.CADNA has been used on CPU with several kinds of instability detection: • the detection of all kinds of instabilities; • no detection of instabilities.With this mode, which is not recommended, the execution time can be considered the minimum that can be obtained whatever instability detection chosen; • the detection of unstable multiplications, unstable divisions and unstable branching statements.This mode, which enables the self-validation of CADNA, is also available on GPU.One can notice that the cost of CADNA with instability detection in multiplications, divisions and branching statements is very close to its cost with no instability detection.Actually this code cannot generate such instabilities: in all multiplications at least one operand is a constant, all divisors are constants and it has no branching statement.The cost of CADNA with the detection of any kind of instability is 2.6 times higher.This is essentially due to the cancellation detection which is particularly expensive in terms of execution time.The cost of CADNA on GPU is about twice lower than on CPU with the same level of instability detection.This may be explained by the pipeline flush at each change of rounding mode on CPU which affects instruction level parallelism.V. CONCLUSION In this paper, we have shown that DSA can provide an estimation of the reproducibility of numerical programs.By estimating which digits are affected by rounding errors, DSA may explain why differences are observed in the results of a program executed in different environments.Therefore when the deployment of a code on a parallel architecture generates differences in the computed results, the presence of a bug can possibly be discarded.Based on a probabilistic approach, DSA can provide a trend of the distribution or the evolution of results accuracy.If results differences are due to different orders in the sequence of floating-point operations, a similar trend should be provided by DSA whatever the environment chosen.But a sequential implementation of DSA is not sufficient and efficient methods must be proposed for the numerical validation of large scale simulation programs.This paper has shown the feasibility of numerical validation with DSA for CPU-GPU programs.Furthermore DSA can be used for accuracy estimation in distributed memory environments [17].It has recently been shown how to take advantage of SIMD units such as AVX (Advanced Vector eXtensions) in programs using DSA [30].However, work must still be carried out to extend efficiently DSA to emerging computing architectures that are prone to numerical reproducibility failures.

FABIENNE
JEZEQUEL, JEAN-LUC LAMOTTE, ISSAM SAID: ESTIMATION OF NUMERICAL REPRODUCIBILITY ON CPU AND GPU CUDA.This version differs from the sequential version in two main respects: rounding mode change and instability detection.

Fig. 1 .
Fig. 1.Number of exact significant digits in the results computed on CPU with respect to their absolute values.

Fig. 2 .
Fig. 2. Number of exact significant digits in the results computed on GPU with respect to their absolute values.
one can notice that the digits provided by CADNA in binary32 are in common with those computed in binary64.Results reported in TableIIhave been computed using implementation (1) of the finite difference scheme.The same digits are provided by CADNA with the other implementation, except one less digit is given on GPU for point p 1 .

TABLE III EXECUTION
TIMES WITH AND WITHOUT CADNA ON CPU AND GPU