Performance output data and configurations of stencil compilers experiments run through PROVA!

The data in this article are related to the research article titled “Reproducible Stencil Compiler Benchmarks Using PROVA!”. Stencil kernels have been implemented using a naïve OpenMP (OpenMP Architecture Review Board, 2016) [1] parallelization and then using the stencil compilers PATUS (Christen et al., 2011) [2] and (Bondhugula et al., 2008) PLUTO [3]. Performance experiments have been run on different architectures, by using PROVA! (Guerrera et al., 2017) [4], a distributed workflow and system management tool to conduct reproducible research in computational sciences. Information like version of the compiler, compilation flags, configurations, experiment parameters and raw results are fundamental contextual information for the reproducibility of an experiment. All this information is automatically stored by PROVA! and, for the experiments presented in this paper, are available at https://github.com/sguera/FGCS17.


Data
Information like version of the compiler, compilation flags, configurations, experiment parameters and raw results are fundamental contextual information for the reproducibility of an experiment. All these information are automatically stored by PROVA! when creating a project, a method, or an experiment. Each of them holds a descriptor that stores the relevant information and makes them available to the tool and the users, when needed.
For each used software PROVA! stores the building and installing recipes, in the form of easyconfig files (used by EasyBuild), the compilation and execution commands, together with their environment (automatically using environment modules, in a way transparent to the user), self-documenting the whole research from the creation of a project until the run of an experiment.
Thus the general structure of the data in the repository is: 2. Experimental design, materials, and methods

Experiment 1
In this experiment has been solved a classical wave equation with a fourth order-in-space and second order in time finite difference method. After the discretization, 3 implementations have been produced: a naïve C þ OpenMP a C source with PLUTO directives a source in a Domain Specific Language used as input for PATUS. The easyconfigs used to install such modules are available at the repository.
The experiment has been conducted using three dimensional grids of size 200 3 and IEEE single precision arithmetic (float), over 100 timesteps.

Experiment 2
The second problem we chose to solve is a blur filtering. The filter matrix of the smoothing used, corresponds to a discrete two dimensional Gaussian function: G x; y ð Þ¼ 1 ffiffiffiffiffiffiffiffi 2πσ 2 p expðÀ x 2 þ y 2 2σ 2 Þ, where σ denotes the width (i.e. standard deviation) of the bell-shaped function. Gaussian filters are isotropic if the filter matrix is large enough (at least 5 Â 5, like in our case) to provide a sufficient approximation.
The size of our grids is 1024 2 points and we calculate 50 timesteps in IEEE single precision arithmetic.
The same systems and methods of Experiment 1 have been used.

Experiment 3
A classical heat equation, describing the temperature change over time, given initial temperature distribution and boundary conditions. A finite differencing scheme is employed to solve the heat equation numerically on a square region. The size of the grids is 512 2 and we calculate 100 timesteps in IEEE single precision arithmetic.
The implementation used makes use of MPI for the parallelization.
The compute system used is Mint (details presented above). To complement that description, it is equipped with Mellanox Infiniband MT26428 [ConnectX VPI PCIe 2.0 5GT/s -IB QDR / 10GigE].

Experiment 4
The problem solved is the same presented in the Experiment 1, using a naïve implementation, parallelized with OpenMP, using NUMA-aware initialization. The compute system where it was run is the KNL partition of the The environment module used is GCC/ 4.9.3-2.25.

Experiment 5
In this experiment a 2D Jacobi calculation is carried out, computing a new value for each element of the grid, such a value being an average of the element's current value and the values of its neighbors. Several grid sizes were used (512 2 , 1024 2 , 2048 2 and 4096 2 ), calculating 10,000 time-steps. The implementation uses CUDA to parallelize the kernel. The experiments ran on a machine hosting a NVIDIA GeForce GTX 1060: 10 Multiprocessors, 128 CUDA Cores/MP, warp size of 32, 1.85 GHz max clock rate, 1.5 MiB L2 cache and 6 GiB of 4 GHz RAM. The device to device memory bandwidth is circa 14 GB/s, while the host to device and device to host peaks at 12GB/s (measured via the NVIDIA bandwidth test, available in CUDA).

Transparency document. Supporting information
Transparency data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2018.08.092.