Performance data of multiple-precision scalar and vector BLAS operations on CPU and GPU

Many optimized linear algebra packages support the single- and double-precision floating-point data types. However, there are a number of important applications that require a higher level of precision, up to hundreds or even thousands of digits. This article presents performance data of four dense basic linear algebra subprograms – ASUM, DOT, SCAL, and AXPY – implemented using existing extended-/multiple-precision software for conventional central processing units and CUDA compatible graphics processing units. The following open source packages are considered: MPFR, MPDECIMAL, ARPREC, MPACK, XBLAS, GARPREC, CAMPARY, CUMP, and MPRES-BLAS. The execution time of CPU and GPU implementations is measured at a fixed problem size and various levels of numeric precision. The data in this article are related to the research article entitled “Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units” [1].


Value of the data
• The data obtained allows comparing the efficiency (in terms of execution time) of various multiple-precision packages when performing BLAS Level 1 operations, which are the building blocks for many linear algebra algorithms. • The data could be useful for developing GPU accelerated applications that require more precision than the standard double precision available in most existing BLAS libraries. • They can benefit researchers dealing with scientific and engineering calculations that are sensitive to rounding errors (e.g., ill-conditioned linear systems and eigenvalue problems). • These data can also be used to understand the impact on performance of computations with higher levels of precision performed on multicore processors and massively parallel graphics processing units.

Data description
The data presented in this paper are performance measurements of ASUM, DOT, SCAL and AXPY functions from Level 1 BLAS [2] implemented using multiple-precision software for central processing units (CPUs) and CUDA-enabled graphics processing units (GPUs). The ASUM operation computes the sum of magnitudes of the vector elements. The DOT operation computes a vector-vector dot product. The SCAL operation computes the product of a vector by a scalar. The AXPY operation computes a vector-scalar product and adds the result to a vector. All data consists of 60 CSV files (raw data) and two tables (processed data).
The raw data of the experiments are available at the Mendeley Data repository [3] . The raw data are organized in two folders named "1. Precision from 120 to 2400 bits " and "2. Precision from 106 to 424 bits ". The first folder contains the performance data of implementations using the MPFR, ARPREC, MPDECIMAL, MPACK, GARPREC, CUMP, and MPRES-BLAS packages for precisions of 120, 240, 480, 720, 960, 1200, 1440, 1680, 1920, 2160, and 2400 bits. The second folder contains the performance data of implementations using the XBLAS, CAMPARY, and MPRES-BLAS packages for precisions of 106, 212, 318, and 424 bits. Each raw file contains the results of three test runs at a fixed operation size of 1,0 0 0,0 0 0. For each test run, the BLAS func-  tion was repeated ten times, and the raw file presents the total execution time of ten iterations (in milliseconds).
The processed data are reported in Tables 1 and 2 . Table 1 presents the average execution time of the MPFR, ARPREC, MPDECIMAL, MPACK, GARPREC, CUMP, and MPRES-BLAS packages with precisions from 120 to 2400 bits. Table 2 reports the average time of the XBLAS, CAM-PARY, and MPRES-BLAS packages with precisions of 106, 212, 318, and 424 bits. The tables allow evaluating the benefits of using GPUs to perform computation with extended/multiple precision.

Experimental design, materials, and methods
All the experiments were carried out at a fixed operation size of 1,0 0 0,0 0 0. The input vectors were composed of randomly generated floating-point numbers in the range [ −1; 1]. In order to generate uniformly distributed random significands, we used the mpz_urandomb function from the GNU MP Bignum Library ( https://gmplib.org/ ). Measurements do not include the time spent transferring data between the CPU and the GPU. We have also excluded the time of converting data into internal multiple-precision representations.
The function clock_gettime was used to measure the execution times of CPU implementations. For GPU implementations, the execution times were measured using CUDA Events. In order to reduce the impact of noise, no other applications were launched during the test execution, and the GUI was not used. Three runs were performed for each test case. At each test run, the BLAS function under evaluation was repeated ten times, and the total execution time of all iterations was measured.
A summary of the experimental setup is given in Table 3 . Table 4 contains a brief description of the considered multiple-precision software.
Using arithmetic operations from MPFR, ARPERC, MPDECIMAL, GARPREC, CUMP and CAM-PARY, we have implemented multiple-precision ASUM, DOT, SCAL, and AXPY for CPU and GPU. The CPU-based codes were developed using OpenMP and executed in parallel with 4 threads on 4 physical cores.
For MPACK, we used the mpreal data type and the Rasum, Rdot, Rscal , and Raxpy routines, which are based on MPFR C ++ ( http://www.holoborodko.com/pavel/mpfr/ ). Note that only the Rdot and Raxpy routines support multi-threaded calculations, and these routines were performed with 4 OpenMP threads, whereas Rasum and Rscal were performed with a single thread. For XBLAS, the double-double precision routines BL AS_dsum_x, BL AS_ddot_x , and BLAS_dwaxpby_x were evaluated, which provide 106 bits of internal precision. Since XBLAS does not support parallel computation, these routines were executed with a single thread. Note that the BLAS_dsum_x routine computes the sum of the vector elements, not the sum of absolute values of the vector elements. Furthermore, XBLAS does not implement the SCAL operation.
In the case of MPRES-BLAS, we used the routines mpasum, mpdot, mpscal , and mpaxpy . These routines are implemented as host functions that invoke GPU kernels. Each routine has a set of template parameters that specify the kernel execution configurations. These parameters are described in Table 5 . Table 6 shows the kernel execution configurations used in the experiments. Table 5 Template parameters of the MPRES-BLAS routines; for details, see [1] .

Routine
Parameter Description mpasum gridDim1 The number of blocks for parallel summation blockDim1 The number of threads per block for parallel summation mpdot gridDim1 The number of blocks for computing the signs, exponents, RNS interval evaluations, and for rounding the result in vector-vector multiplication blockDim1 The number of threads per block for computing the signs, exponents, RNS interval evaluations, and for rounding the result in vector-vector multiplication gridDim2 The number of blocks for computing the digits (residues) of multiple-precision significands in vector-vector multiplication gridDim3 The number of blocks for reducing the vector of products blockDim3 The number of threads per block for reducing the vector of products mpscal, mpaxpy gridDim1 The number of blocks for computing the signs, exponents, RNS interval evaluations, and for rounding the result blockDim1 The number of threads per block for computing the signs, exponents, RNS interval evaluations, and also for rounding the result gridDim2 The number of blocks for computing the digits (residues) of multiple-precision significands Table 6 MPRES-BLAS execution configurations used in the experiments.