Performance and profiling data of plane-wave calculations in quantum ESPRESSO simulation on three supercomputing centres

This dataset reflects the parallel execution profiles of five Quantum ESPRESSO simulation (QE) versions in finding the total energy of the Cerium Oxide lattice using the self-consistent field (SCF) method. The data analysis used a strong scale setting to identify the optimal parameters and computing resources needed to complete a single SCF loop for one specific material efficiently. This analysis notably contributed to achieving the Best Performance Award at the 5th APAC HPC-AI Competition. The data comprises three sets. The first set features the parallel execution traces captured via the Extrae performance profiling tool, offering a broad view of the QE's model execution behaviour and how it used computational resources. The second set records how long QE's model ran on a single node at three HPC centres: ThaiSC TARA in Thailand, NSCC ASPIRE-1 in Singapore, and NCI Gadi in Australia. This set focuses on the impact of adjusting three parameters for K-point parallelisation. The final set presents benchmarking data generated by scaling out the QE's model across 32 nodes (1,536 CPU cores) on the NCI Gadi supercomputer. Despite its focus on a single material, the dataset serves as a roadmap for researchers to estimate required computational resources and understand scalability bottlenecks, offering general guidelines adaptable across different HPC systems.


a b s t r a c t
This dataset reflects the parallel execution profiles of five Quantum ESPRESSO simulation (QE) versions in finding the total energy of the Cerium Oxide lattice using the selfconsistent field (SCF) method.The data analysis used a strong scale setting to identify the optimal parameters and computing resources needed to complete a single SCF loop for one specific material efficiently.This analysis notably contributed to achieving the Best Performance Award at the 5th APAC HPC-AI Competition.The data comprises three sets.The first set features the parallel execution traces captured via the Extrae performance profiling tool, offering a broad view of the QE's model execution behaviour and how it used computational resources.The second set records how long QE's model ran on a single node at three HPC centres: ThaiSC TARA in Thailand, NSCC ASPIRE-1 in Singapore, and NCI Gadi in Australia.This set focuses on the impact of adjusting three parameters for K-point parallelisation.The final set presents benchmarking data generated by scaling out the QE's model across 32 nodes (1,536 CPU cores) on the NCI Gadi supercomputer.Despite its focus on a single material, the dataset serves as a roadmap for researchers to estimate required computational resources and understand scalability bottlenecks, offering general guidelines adaptable across different HPC systems.
© 2023 The Author(s).Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Specifications Table Subject Computer Science Specific subject area This paper concerns the execution performance analysis, profiling and benchmarking of scientific workload on high-performance computers.Type of data Tables, Paraver Trace Files, and CSV Files How the data were acquired The profiling and performance data were acquired by conducting performance evaluation experiments on three supercomputers all having CentOS Linux operating system.The experiments used the strong scaling approach to optimise the QE software's plane-wave function (pw.x) based on the same input lattice of Cerium Oxide (CeO 2 having 20 k-points, N3 = 72 and NR3 = 96) under the same pseudo energy condition.All associate software was set up as following.
• The pre-existing QE software package on various HPC systems was activated using the "module load" command.This included QE 6.5 on ThaiSC TARA, QE 6.6 and 6.7 on NSCC ASPIRE-1, and QE 7.0 on NCI Gadi.Additionally, a newer version, QE 7.1, was built into our project directory on the NCI Gadi system for explorative purposes regarding scalability.• Performance analysis tools pre-built from Barcelona Super Computing Centre, namely Paraver, Dimemas, Clustering, and Spectral, were installed in the project space on NCI Gadi.• The Extrae trace collection tool was configured and built from its source, linking with dependent libraries available in the NCI Gadi's environment.The installation was on our project directory.
A dedicated folder was established to measure the performance data.The Ce.O 2 input lattice was organised within the folder, accompanied by two pseudo-potential files.Each HPC centre's specific job scheduling system dictated the type of batch job scripts ( i.e. , PBS for both Gadi and ASPIRE-1, and Slurm for TARA).These scripts informed the scheduler about job execution specifics.The job submission script detailed the following: • The request number of CPU cores, memory allocation, and estimated runtime.
• Commands to configure the execution environment, establish parameter values (such as OMP_NUM_THREADS, NPOOL, NDIAG), load the QE software, and input the CeO 2 data set with designated parameters.• The use of either "mpirun" or "srun" to initiate the parallel QE job on HPC compute nodes.These commands facilitate the Message Passing Interface (MPI) runtime, ensuring seamless communication across all CPU cores during the job.
In the first part of the dataset, the execution trace collection was carried out on NCI Gadi to capture the comprehensive execution behaviour and pinpoint potential performance bottlenecks.The procedure for this collection is outlined below: • The LD_PRELOAD method was used to collect the trace files in this dataset.The method enabled Extrae to intercept MPI function calls, gathering execution data specifically for the pw.x function within QE.To facilitate this, the Extrae trace collection script, trace.sh,was inserted after the 'mpirun' command and before initiating the 'pw.x' command.• The information collected in trace files during the QE software's runtime adheres to a profiling specification written in the XML format (extrae.xml).• The traces from all CPU cores were saved temporarily in a directory named SET-0.
Subsequently, these were merged into a raw trace file, TRACE.mpit.The raw trace file records time-stamped events generated by the QE execution.
( continued on next page ) • Concluding the trace collection procedure, the mpi2prv command was invoked to transform the TRACE.mpitfiles, created by Extrae, into a .prvformat.This format is compatible with and can be interpreted by the Paraver performance analysis tool.• Due to file size constraints, this dataset includes only the essential files required for visualisation using the Paraver performance analysis tool.These files encompass the .prv(Paraver trace file), a .row,and a .pcf(Paraver Configuration File) files.
In the second and third parts of the dataset, the execution time of multiple QE runs was collected.Based on full-factorial experimental design, each run used a distinct parameter-level combination, leading to varying input lattice distribution patterns.Every execution was dispatched to the compute nodes using batch job submission systems.Upon completing each job, the execution profile was individually documented in separate Excel worksheets.Data format Raw, Analysed and Filtered Description of data collection Performance data was gathered to analyse and identify the optimal parameter settings in line with the 5th APAC HPC-AI Student Competition requirements, held online between June and October 2022 in the Asia Pacific region, including Australia.Each team received identical input files for the CeO 2 lattice for the competition, accompanied by its pseudo-potential specifications.The central challenge was to conduct a strong scale analysis on the NCI Gadi supercomputer, aiming for the best execution time for the planewave function, which calculates the SCF loop for the provided input.Both the NCI Gadi and NSCC ASPIRE-1 supercomputers were made accessible to all competing teams.Each team was allocated the same amount of Service Units (SUs) and could utilise up to 32 48-core nodes in a single run.Though teams had the freedom to choose any QE version, version 7.0 was readily available on NCI Gadi.Teams could adjust three parameters: diagonalisation (NDIAG), number of CPU pools (NPOOL), and the number of OpenMP threads (OMP_NUM_THREADS).The use of a custom-built QE software stack was permissible.However, modifying any inputs or leveraging GPUs for acceleration was prohibited.The competition task required each team to conduct a single-node experiment to determine the optimal parameter settings before scaling out to multiple nodes.Given the permission to use a custom-build stack, we first gained insight into the QE execution dynamics before starting a single-node experiment.This was done by recording execution traces to visualise task distribution and pinpoint potential bottlenecks.Consequently, the first dataset includes these trace files, enabling a detailed execution trace visualisation through Paraver.Once these trace files are visualised within Paraver, one can thoroughly inspect the overall execution behaviour of QE.This provides a clear view of load imbalance issues arising from communication and computation, offering guidance on priority areas to enhance performance.A full-factorial experimental design was employed to determine the optimal parameter settings for executing the pw.x function on a single node.Since we also had access to the ThaiSC TARA supercomputer, having two QE versions, we conducted the single-node experiment across each QE version in three HPC centres.The second part of the dataset, "Screening performance study using a single computing node", was gathered during this phase.Analysing this data revealed the dominant parameters consistently observed across all QE versions tested.Additionally, variation in execution times across diverse environments in different HPC centres was observed.From this dataset, the best parameter setting for optimal performance was identified.The best single-node parameter setting (OMP_NUM_THREADS = 1, NPOOL = 4 and NDIAG = 9) was scaled out on the NCI Gadi multi-node environment.A critical new factor was finding the right number of nodes for k-point distribution and varying the "Process Per Resource" or PPR mapping available in the Open MPI implementation on the Gadi system.Again, a full-factorial experimental design was employed to observe each factor's impact systematically.The third part of this dataset, "Scalable performance data using multiple computing nodes" was collected during this process.The dataset includes the derived metrics against a single-core execution time as a baseline.The obtained speedup and Karp-flatt's metrics depicted the performance trend and underlined overhead.

Value of the Data
• Despite focusing on one material with a specific set of input parameters, the scaling analysis data depicts the general trends and identifies performance bottlenecks likely to be indicative of challenges faced in other computational setups.Thus, the data can assist material science researchers in estimating the required computational resources and the bottlenecks that impact their simulations.• The execution data and profiling traces can complement larger datasets, contributing to developing models for HPC performance optimisation and anomaly detection.
• The dataset provides a template for conducting and presenting performance and scalability analysis.It offers general guidelines on how profiling data can be used to gain further insights into parallel execution and communication behaviour that could be adapted to different com putational settings and systems.• Experimental data and data collection methodology could be used for practical training on scalability analysis of parallel and distributed systems.• The job submission scripts, input and output data, and build scripts could be used as a template for domain scientists to be adapted to different computational settings and systems.

Objective
This dataset offers an exploration into the distributed processing behaviour of an SCF loop calculation in Quantum ESPRESSO software across various computational environments.The dataset was generated to respond to the 5th APAC HPC-AI competition aiming for optimal HPC performance.The K-point distribution was refined through detailed execution profile analysis; using a strong scaling setup, the goal was to identify the parameter setting towards the fastest execution time when expanding the workload with a consistent input set (CeO 2 lattice) on the specific pw.xfunction.The dataset was realised through a three-step process shown in Fig. 1 .Initially, the focus was visualising how the workload was distributed across multiple cores at run time to pinpoint inefficiencies.With that foundational knowledge, various versions of QE were tested on a single computational node on three HPC Centres: ThaiSC TARA Thailand, NSCC ASPIRE-1 Singapore, and NCI Gadi Australia, to identify optimal settings for maximal performance on that scale.Finally, having determined the best settings for a single node, those settings were then leveraged as a starting point for scaling the workload across a larger cluster of nodes.This scaling aimed to fine-tune parameters further to achieve the best performance when using the resource up to 32 48-core nodes' environment of the NCI Gadi supercomputer.This data article provides the performance and profiling dataset used in the analysis, which other researchers can use to reproduce our results, perform case studies, or test different parameter tuning strategies.

Data Description
The dataset reflects the execution profiles and performance of calculating a plane-wave selfconsistent field using the Quantum ESPRESSO simulation software [1 , 2] on HPC.Data was collected by launching batch jobs to the compute partition of the HPC centres using 1 to 1536 CPU cores; each job took the same input file and aimed to calculate the total energy of the Cerium Oxide lattice.The the exchange-correlation potential with the generalized gradient approximation (GGA) using the Perdew-Burke-Ernzerhof (PBE) exchange-correlation functional [9 , 10] .The interaction between electrons and ions was described by the Projector Augmented Wave pseudopotential.The valence electronic-structure wave function was expanded in the plane wave basis set.The kinetic energy cutoff for wavefunction was set to 50 Ry, and that for charge density was set to 400 Ry.The input lattice represents a finite number of points in the Brillouin zone called k-point mesh which can be manually partitioned for parallelly calculated by multiple CPU cores using the pw.x function.The k-point mesh was initially set to 20 in the calculation.The function distributes the partitioned mesh to CPU pools for iterative calculation using Message Passing Interface (MPI) and OpenMP-enabled libraries.
The dataset comprises three parts in the format of Paraver trace, and tabular form stored in spreadsheet files (.xlsx).

Detailed Execution Trace Visualised by Using Paraver
The first set is the single-node execution trace (.prv files) collected on a single node (48 CPU cores) of the NCIU Gadi supercomputer using the Extrae and Paraver tools [3] .These trace files must be opened by using the Paraver tool for interactively viewing different aspects of the QE's runtime activity.Table 1 and Fig. 2 below gives an example of performance information one could get from the trace files.

Screening Performance Study Using a Single Computing Node
The second part of the dataset is the execution time and parallel speedup of a single node and 240 cores collected from ThaiSC TARA, NCI Gadi and NSCC ASPIRE1 supercomputers.The data presents in three spreadsheet files (.xlsx format) and could be used to create the following plots (shown in Figs.3-8 ) to analyses the single node performance.The broad range of execution time was observed to solve the same problem using the same number of CPU cores on different environments (shown in Fig. 9 ).

Scalable Performance Data Using Multiple Computing Nodes
The third part of the dataset is the execution time and performance data collected when scaling the execution from 1 to 32 NCI Gadi supercomputer nodes, varying two parameters, i.e., NPOOL and NDIAG.The data presents in three spreadsheet files (.xlsx format) and could be used to create the following plots (shown in Figs.10-13 ) to analyses the scalability of the workload.

Parallel execution time of QE 7.0 and 7.1 on NCI Gadi supercomputer
Fig. 10.Execution Time of QE 7.0 vs QE 7.1 running PW.x (Ce.O2) on NCI Gadi Supercomputer measured on 1-32 48-core Gadi's nodes using three parameter settings.First, divided k-points into four pools (#pool = 4) and having 9 diagonal worker threads (#diag = 9) was the best setting found in the single node experiment.The latter two settings was based on partitioning K-point to fit the number of nodes (#pool = #nodes).Two sizes of diagonal worker threads (#diag = 9 and 16) were explored.Best execution time observed shown in the plot (16-thread diagonalization).

Aims
This dataset collected specifically for the HPC benchmarking challenge at the 2022 APAC HPC-AI student competition [4] .The central goal of this challenge was to optimise the tuneable parameters to reach the optimal performance, i.e. the shortest CPU time for calculating the system energy on the NCI Gadi supercomputer [5] .Additionally, benchmarks were carried out on the NSCC ASPIRE-1 [6] and ThaiSC TARA [7] systems.This was to gain insight into the software structure and evaluate the performance impacts of tuneable parameters.

Materials
• Access QE .The QE software was access via loading module file on the HPC centres and via custom built the software from source code available on QE GitHub.This dataset has been collected using five versions of QE software (summarised in Table 3 ) • Job submission • input files : CeO2.in -includes all the parameters to set up a SCF calculation using QE, Pseudopotentials -under the folder name /pseudo.The choice of pseudopotentials for Ce and atoms is based on the Standard solid-state pseudopotentials (SSSP).• Submission script : Jobs must be tested on a single node first then scale up to the maximum of 32 48-core nodes of the Gadi's normal queue.An example job scripts is as following.
#!/bin/bash #PBS -q normal #PBS -l walltime = 0 0:30:0 0 • Tunable and Non-Tunable Parameters Parameter setup in the CeO 2 energy calculation that were adjusted are number of pools (-npool), diagonalization (-ndiag) and number of OpenMP threads.Parameters that are directly associated with the model accuracy are NOT allowed to be modified, i.e., The pseudopotentials and the input file (CeO2.in).

• Platforms
The execution platforms and the computing resources provided by three HPC Centres (summarised in Table 2 ).• Performance Analysis Software The Extrae/Paraver performance profiling tool was installed on NCI Gadi and used for collecting execution trace.Fig. 13 shows the method to build the Extrae/Paraver tool and the steps to collect the single-node execution profiles of the QE workloads.The profiles were then visualised using Paraver to observe the change in runtime behaviour when setting different configurations.

• Output
-Execution Traces -Some measured and derived performance metrics (summarised in Table 4 ) have been collected.

Experimental Design and Methods
The complete full-factorial experimental design has been used in studying single-node performance and multi-node scalability.Tables 3-5 summarises the factor-level configuration, metrics and commands used for collecting execution time as reported in this data set.Five versions of QE software available and custom-built on three HPC centres (shown in Table 3 ) were run on every centre's compute node.The execution time of running the QE workload using a single-core, one pool and one diagonalization served as a baseline performance.Some derived performance metrics, i.e., speedup factor, efficiency and experimentally determined serial fraction (e) from Karp-flatt's metric [8] were calculated using the baseline execution time ( Fig. 14 ).

Fig. 1 .
Fig. 1.The systematic approach for dataset generation to meet the objective.

Fig. 11 .
Fig. 11.Parallel Speedup of QE version 7.0 measured on 1-32 48-core nodes of NCI Gadi supercomputer.Performance is benchmarked against a single-core, single-pool and 1-thread for diagonalization setup.
continued on next page )

Table 1
Example of execution trace visualization and performance information generated by Paraver.

Table 2
Configuration of HPC Centres.

Table 5
Experimental design for multi-node performance study.