University of Warsaw Lagrangian Cloud Model (UWLCM) 2.0: Adaptation of a mixed Eulerian-Lagrangian numerical model for heterogeneous computing clusters

A numerical cloud model with Lagrangian particles coupled to an Eulerian flow is adapted for distributed memory systems. Eulerian and Lagrangian calculations can be done in parallel on CPUs and GPUs, respectively. The fraction of time when CPUs and GPUs work simultaneously is maximized at around 80% for an optimal ratio of CPU and GPU workloads. The optimal ratio of workloads is different for different systems, because it depends on the relation between computing performance of 5 CPUs and GPUs. GPU workload can be adjusted by changing the number of Lagrangian particles, which is limited by device memory. Lagrangian computations scale with the number of nodes better than Eulerian computations, because the former do not require collective communications. This means that the ratio of CPU and GPU computation times also depends on the number of nodes. Therefore, for a fixed number of Lagrangian particles, there is an optimal number of nodes, for which the time CPUs and GPUs work simultaneously is maximized. Scaling efficiency up to this optimal number of nodes is close to 10 100%. Simulations that use both CPUs and GPUs take between 10 to 120 less time and use between 10 to 60 times less energy than simulations run on CPUs only. Simulations with Lagrangian microphysics take up to eight times longer to finish than simulations with Eulerian bulk microphysics, but the difference decreases as more nodes are used. Presented method of adaptation for computing clusters can be used in any numerical model with Lagrangian particles coupled to an Eulerian fluid flow. 15

gradient, which are applied implicitly. Pressure perturbation is solved using the generalized conjugate residual solver (Smolarkiewicz and Margolin, 2000). Diffusion of Eulerian fields caused by subgrid-scale (SGS) turbulence can be modeled with a Smagorinsky-type model (Smagorinsky, 1963) or with the implicit LES approach (Grinstein et al., 2007). 60 Cloud microphysics can be modeled with a single-or double-moment bulk scheme, or with a Lagrangian particle-based model. Depending on the microphysics model, simulations are named UWLCM-B1M (single-moment bulk scheme), UWLCM-B2M (double-moment bulk scheme) or UWLCM-SDM (super-droplet method). Details of microphysics models can be found in Arabas et al. (2015). In both bulk schemes, cloud water and rain water mixing ratios are prognostic Eulerian variables. In the double-moment scheme, cloud droplet and rain drop concentrations are also prognostic Eulerian variables. In the Lagrangian ulations with Lagrangian microphysics computed by GPUs. Therefore, this most complicated case is discussed first. Simpler cases with microphysics calculated by CPUs will be discussed afterwards.
The difficulty in designing a distributed memory implementation of code in which CPUs and GPUs simultaneously conduct different tasks is in obtaining a balanced workload distribution between different processing units. This is because GPUs have higher throughput than CPUs, but the GPU device memory is rather low, what which puts an upper limit on the GPU workload. 95 Taking this into account, we chose to use a domain decomposition approach that is visualized in fig. 2. The modeled domain is divided into equal slices along the horizontal axis x. Computations in each slice are done by a single MPI taskprocess, which can control multiple GPUs and CPU threads. Cloud microphysics within the slice are calculated on GPUs, with super-droplets residing in GPU device memory. Eulerian fields in the slice reside in system host memory and their evolution is calculated by CPU threads. Since the CPU and GPU data attributed to a task process are colocated in the modeled space, all CPU-to-GPU 100 and GPU-to-CPU communications happen via PCI-Express and do not require inter-node data transfer. The only inter-node communications are CPU-to-CPU and GPU-to-GPU. If an MPI task process controls more than one GPU, computations within the subdomain of that task process are divided among the GPUs also using domain decomposition along the x axis. An MPI task will typically control more than one CPU thread, because usually cluster nodes have more CPU cores than GPUs . Intra-node communication between GPUs controlled by a single process makes use of the NVIDIA GPUDirect Peer to Peer technology, which allows direct 105 transfers between memories of different devices. Intra-node and inter-node transfers between GPUs controlled by different processes are handled by the MPI implementation. If the MPI implementation uses the NVIDIA GPUDirect Remote Direct Memory Access (RMDA) technology, inter-node GPU-to-GPU transfers go directly from device memory to the interconnect, without host memory buffers.
Computations are divided between CPU threads of a task process using domain decomposition of the task' s process' sub- 110 domain, but along the y axis. Thanks to that, the The maximum number of GPUs that can be used in a simulation is equal to the number of cells in the x directioneven though there are multiple CPU threads per GPU. . MPI communications are done using two communicators, one for the Eulerian data and one for the Lagrangian data. Transfers of the Eulerian data are handled simultaneously by two threads, one for each boundary that is perpendicular to the x axis. This requires that the MPI implementation supports the MPI_THREAD_MULTIPLE thread level. Transfers of the Lagrangian data are handled by the thread that controls the GPU 115 that is on the edge of the task' s process' subdomain. Collective MPI communication is done only on the Eulerian variables and most of it is associated with solving the pressure problem.
It is possible to run simulations with microphysics, either Lagrangian particle-based or bulk, computed by CPUs. In the case of bulk microphysics, microphysical properties are represented by Eulerian fields that are divided between tasks processes and threads in the same manner as described in the previous paragraph, i.e. like the Eulerian fields in UWLCM-SDM. In UWLCM-120 SDM with microphysics computed by CPUs, all microphysical calculations in the subdomain belonging to a given MPI task process are divided amongst task' s the process' threads by the NVIDIA Thrust library (Bell and Hoberock, 2012).
File output is done in parallel by all MPI tasks using the parllel processes using the parallel HDF5 C++ library (The HDF Group).

Simulation setup 125
Model performance is tested in simulations of a raising moist thermal (Grabowski et al., 2018). In this setup, an initial spherical perturbation is introduced to a neutrally stable atmosphere. Within the perturbation, water vapour content is increased to obtain RH=100%. With time, the perturbation is lifted by buoyancy and water vapor condenses within it. We chose this setup, because it has significant differences in buoyancy and cloud formation already at the start of a simulation. This puts pressure solver and microphysics model to test without need of a spinup period.

130
Subgrid-scale diffusion of Eulerian fields is modeled with the Smagorinsky scheme. SGS motion of hydrometeors is modeled with a scheme described in (Grabowski & Abade 17). Model time step length is 0.5 s. Substepping is done to achieve a time step of 0.1 s for condensation and coalescence. These are values typically used when modeling clouds with UWLCM. No output of model data is done.

135
Performance tests were ran on three systems: Rysy, a02 and Prometheus. Hardware and software of these systems is given in table 1 and table 2, respectively. Rysy and a02 were used only in the single-node tests, while Prometheus was used both in single-and multi-node tests. Prometheus has 72 GPU nodes connected with Infiniband. We chose to use the MVAPICH2 2.3.1 MPI implementation on Prometheus, because it supports the MPI_THREAD_MULTIPLE thread level, is CUDA-aware and is free to use. Other implementation that meets these criteria is OpenMPI, but it was found to give lower performance greater 140 simulation wall time in scaling tests of libmpdata++ ( appendix B). NVIDIA GPUDirect and GDRCopy technologies were RDMA was not used by the MPI implementation, because they are it is not supported by MVAPICH2 for the type of interconnect used on Prometheus. MVAPICH2 does not allow more than one GPU per taskprocess. Therefore multi-node tests were done for 2 tasks processes per node, each task process controlling 1 GPU and 12 CPU threads.

Performance metrics 145
Wall time taken to complete one model time step t tot is divided into three parts, t tot = t CPU + t GPU + t CPU&GPU , where: t CPU is the time when CPU is performing work and the GPU is idle, t GPU is the time when GPU is performing work and the CPU is idle, t CPU&GPU is the time when CPU and GPU are performing work simultaneously.
The total time of CPU (GPU) computations is t tot CPU = t CPU + t CPU&GPU (t tot GPU = t GPU + t CPU&GPU ). The degree to 150 which CPU and GPU computations are parallelized is measured with t CPU&GPU /t tot . Timings t CPU , t GPU and t CPU&GPU are obtained using a built-in timing functionality of UWLCM that is enabled at compile-time by setting the UWLCM_TIMING CMake variable. The timing functionality does not have any noticeable effect on simulation wall time. Timer for GPU computations is started by a CPU thread just before a task is submitted to the GPU, and is stopped by a CPU thread when the GPU task returns. Therefore GPU timing in t CPU&GPU and in t GPU includes time it takes to dispatch (and to 155 return from) the GPU task.

Single-node performance
In this section we present tests of computational performance of UWLCM-SDM ran on a single serverrun on a single-node system.
The goal is to determine how parallellization parallelization of CPU and GPU computations can be maximized. We also estimate the speedup achieved thanks to the use of GPUs. No MPI communications are done in these tests.

160
Size of the Eulerian computational grid is 128x128x128 cells. In the super-droplet method, quality of microphysics solution depends on the number of super-dropletsused. We denote the initial number of super-droplets per cell by N SD . We perform test for different values of N SD . The maximum possible value of N SD depends on the amount of GPU memory available . When ran on a single server, all Eulerian fields are stored in shared memory. available device memory.  (Shima et al., 2009). It is seen that GPU time t tot GPU in fact increases linearly with N SD . For low , except for low values of 170 N SD majority of time is spent on CPU computations . For N SD = 3, CPU computations take longer than GPU computations (t tot CPU > t tot GPU ) and almost all GPU computations are done in parallel with CPU computations (t GPU ≈ 0). As N SD increases, total time does not increasemuch, but the amount of parallel is increased, we observe that both t tot and t CPU&GPU increase, with t CPU&GPU increasing faster than t tot , and that t CPU decreases. This trend continues up to some value of N SD , for which t tot CPU ≈ t tot GPU . Parallelization of CPU and GPU computations increases until pure GPU computations start to dominate and parallellization starts to decrease again. This is not observed 175 on the (t CPU&GPU /t tot ) is highest for this value of N SD . If N SD is increased above this value, GPU computations take longer than CPU computations, t tot increases linearly and the parallelization of CPU and GPU computations decreases. The threshold value of N SD depends on the system; it is 10 on Prometheusserver, because maximum NSD on it is only 25 due to the limited GPU memory. Maximum CPU and GPU parallellization is achieved for NSD = 32 and NSD = 64 on , 32 on a02 and and 64 on Rysy, respectively. This difference comes from differences in relative CPU to GPU computational power between these machinessystems. In LES, N SD is usually 180 between 30 and 100. The test shows that high parallellization parallelization of CPU and GPU computations, with t CPU&GPU /t tot up to 80 %, can be obtained in typical cloud simulations.
In UWLCM-SDM, microphysical computations can also be done by the CPU. From the user perspective, all that needs to be done is to specify --backend=OpenMP . We do so to test at runtime. To investigate how much speedup is achieved thanks to the use of GPUs. We define the GPU vs CPU speedup as the number of CPUs needed per single GPU to obtain the same wall time, assuming that wall time scales perfectly the CPU) and of CPU+GPU simulations (with microphysics computed by the GPU). Estimated energy cost per time step when using only CPUs (CPUs plus GPUs) for computations. The GPU vs CPU speedup for different machines and different values of NSD is plotted in ??. It is seen that is also compared in fig. 4. We find that simulations that use both CPUs and GPUs take between 10 to 130 times less time and use 190 between 10 to 60 CPUs would be needed to replace a single GPU times less energy than simulations that use only CPUs. Speedup and energy savings increase with N SD and depend on the number and type of CPUs and GPUs. It is important to note that microphysics computations in UWLCM-SDM are dispatched to CPU or GPU by the NVIDIA Thrust library. It is reasonable to expect that the library is better optimized for GPUs, because it is developed by the producer of the GPU. We conclude that GPUs provide substantial benefits in equipment cost and power usage. strong scaling -More nodes are used in order to decrease the time it takes to complete the simulation.
-SD scaling -More nodes are used to increase the total GPU device memory, allowing for more SD to be modeled, while grid size remains the same. This results in weak scaling of GPU workload and strong scaling of CPU workload. This test is applicable only to UWLCM-SDM.

205
-2D grid scaling -As more nodes are used, the number of grid cells in the horizontal directions is increased, while the number of cells in the vertical is constant. In UWLCM-SDM, number of SD per cell is constant. Therefore, as more cells are added, the total number of SD in the domain increases. This results in weak scaling of both CPU and GPU workloads. This test represents two use cases: domain size increase and horizontal resolution refinement. Typically in cloud modeling, domain size is increased only in the horizontal because clouds form only up to certain altitude.

210
-3D grid scaling -similar to 2D grid scaling, but more cells are used in each dimension. This would typically be used to increase resolution of a simulation.
Details of the simulation setup for each case are given in table 3. In each case, the maximum number of super-droplets that fit the GPU device memory is used in UWLCM-SDM. The only exception is the strong scaling test in which, as more nodes are added, the number of SD per GPU decreases. Note how SD scaling is similar to strong scaling, but with more SDs added as more GPUs are added.

215
Also note that the 2D grid scaling and 3D grid scaling tests are similar, but with differences in sizes of distributed memory data transfers.
UWLCM-SDM simulation time versus number of nodes used is plotted in fig. 5. First, we discuss the strong scaling and scenario. We find that most of the time is spent on CPU-only computations. This is because in this scenario N SD = 3, below the threshold value of N SD = 10 determined by the single-node test (section 4.4). As more nodes are added, t CPU&GPU and t tot decrease. Ratio of these two values, which describes the amount of parallelization of CPU and GPU computations, is low (30 %) in a single-node run and further decreases, however slowly, as more nodes are used.
Better parallelization of CPU and GPU computations is seen in the SD scaling testsscenario. In this scenario, the CPU workload scales the same as in the strong scaling scenario, but the workload per GPU remains constant. Largest value of t CPU&GPU /t tot , approximately 80 %, is found for N SD = 10. The same value of N SD was found to give high-225 est t CPU&GPU /t tot in the single-node tests section 4.4. We observe that t tot GPU is approximately constant. Given the weak scaling of GPU workload in this scenario, we conclude that the cost of GPU-GPU MPI communications is small. The small cost of GPU-GPU MPI communications, together with the fact that for N SD > 10 the total time of computations is dominated by GPU computations, gives very high scaling efficiencies, around 100 %. When ran 2D and 3D grid scaling are weak scaling tests which differ in the way the size of CPU MPI communications scales. As in the We 230 find that wall time per time step t tot scales very well (scaling efficiency exceeds 95%) and that t tot is dominated by t tot GPU (t tot GPU > t tot CPU ). The latter observations is consistent with the fact that the number of super-droplets (N SD = 100) is larger than the threshold, N SD = 10, determined in single-node tests. As in strongand SD scalingtests, there is no visible , approximately constant t tot GPU indicates low cost of MPI communications between GPUs. The total CPU-only computation time increases as more nodes are added, because of the cost of MPI data transfers between system memory of different nodes. Majority of this cost is associated with collective communications necessary to solve the pressure 235 perturbation equation. Altogether, a decrease in scaling efficiency and the amount of CPU and GPU parallellization is observed as more nodes are used Contrary to t tot GPU , t tot CPU clearly increases with the number of nodes. For 36 nodes, that This shows that the cost of CPU-CPU MPI communications is 72 GPUs and 864 CPU threads, scaling efficiency is above 60% and non-negligible. Increase of t tot CPU does not cause an increase of t tot , because additional CPU computations are done simultaneously with GPU computations and t tot GPU > t tot CPU in the studied range of the number of nodes. It is reasonable to expect that t tot GPU scales better than t tot CPU also for more nodes than used 240 in this study. In that case, there should be some optimal number of nodes for which t tot GPU ≈ t tot CPU . For this optimal number of nodes both scaling efficiency and parallelization of CPU and GPU parallelization is around 50%. computations are expected to be high.

265
Performance tests for computers with shared system memory Single-node performance tests were done on three different serverssystems, each equipped with multiple GPUs. Maximum amount of parallelization, up to Percentage of time during which CPUs and GPUs simultaneously compute depends on the ratio of CPU to GPU workloads. GPU workload depends on the number of Lagrangian computational particles. For optimal workload ratio, parallel CPU and GPU computations can take more than 80% , is obtained for an optimal ratio of CPU and GPU workloads, which of wall time. This optimal workload ratio depends on the relative computational power 270 of CPU and GPU. The optimal ratio of workloads is obtained for simulation parameters typically used in literature. To complete the simulation in the same amount of time, but without GPU accelerators, each GPU would need to be replaced by even CPUs and GPUs. On all systems tested, workload ratio was optimal for between 10 and 64 Lagrangian particles per Eulerian cell. If only CPUs are used for computations, simulation take up to 120 times longer and consume up to 60 CPUs. That would require a distributed memory system, which would incur additional overhead. This shows times more energy than simulations that use both CPUs and GPUs. We conclude that GPU accelerators enable us to run a running 275 useful scientific simulation on a single server, what would otherwise require a more expensive and more energy-consuming CPU clustersingle-node systems at a decreased energy cost.
Computational performance of the model on a distributed memory system was tested on the Prometheus cluster. Cost We found that cost of communication between nodes slows down computations related to the Eulerian part of the model by a much higher factor than computations related to the Lagrangian part of the model. This is because For example, in a weak scal-280 ing scenario (3D grid scaling) t tot CPU is approximately three times larger on 27 nodes than on one node, while t tot GPU is increased by only around 7% (fig. 5). The reason why Eulerian computations scale worse than Lagrangian computation is that solving the pressure perturbation, what which is done by the Eulerian component, requires collective communications, while the Lagrangian component requires peer-to-peer communications only. In most cases, scaling efficiency and single-node simulations on Prometheus an optimal ratio of CPU to GPU workloads is seen for 10 Lagrangian particles per Eulerian cell. In 285 the literature, number of Lagrangian particles per Eulerian cell typically is higher, between 30 and 100 (Shima et al., 2009;Dziekan et al., 2019Dziekan et al., , 2021b. When such higher number of Lagrangian particles is used in single-nodes simulations on Prometheus, most of the time is spent on Lagrangian computations. However, in multi-node runs, Eulerian computation time scales worse than Lagrangian computation time. Since Eulerian and Lagrangian computations are done simultaneously, there is an optimal number of nodes for which the amount of CPU and GPU parallelization exceed 50time during which CPUs 290 and GPUs work in parallel is maximized and the scaling efficiency is high. In scenarios in which GPU computations take most of the time, scaling efficiency exceeds 95% for up to 40 nodes, with 960 CPU threads and 80 GPUs. A simulation with . Fraction of time during which CPU and GPU work in parallel is between 20million grid cells and 2 billion particles can be done in real time. Thanks to the use of GPUs, a sophisticated microphysics model does not slow down simulation much. Simulations % and 50% for the largest number of nodes used. In weak scaling scenarios, the fraction of time during which both processing units work could be increased by using more 295 nodes, but it was not possible due to the limited size of the cluster. Single-node simulations with Lagrangian microphysics computed by GPUs are only 50% around eight times slower than simulations with bulk microphysics computed by CPUs. However, the difference decreases with the number of nodes. For 36 nodes simulations with Lagrangian microphysics are five times slower, and the difference should be further reduced if more nodes were used.
Our approach of using CPUs for Eulerian calculations and GPUs for Lagrangian calculations gives good amount of parallelization of 300 CPU and GPU calculations and results in CPUs and GPUs computing simultaneously for majority of the time step and gives good scaling on computing clusters multi-node systems with several dozen nodes. The same approach can be used in other numerical models with Lagrangian particles embedded in an Eulerian flow.

Appendix A: Software
UWLCM is written in C++14. It makes extensive use of two C++ libraries that are also developed at the Faculty of Physics of the Universitry of Warsaw: libmpdata++ Waruszewski et al., 2018) and libcloudph++ Jaruga and Pawlowska, 2018). Libmpdata++ is a collection of solvers for generalised transport equations that use the 310 Multidimensional Positive Definite Advection Transport Algorithm (MPDATA) algorithm. Libcloudph++ is a collection of cloud microphysics schemes. In libcloudph++, the particle-based microphysics algorithm is implemented using the NVIDIA Thrust library. Thanks to that, the code can be ran run on GPUs as well as on CPUs. It is possible to use multiple GPUs on a single machine, without MPI. Then, each GPU is controlled by a separate thread and communications between GPUs are done with asynchronous 315 cudaMemcpy. Libmpdata++ uses multidimensional array containers from the blitz++ library (Veldhuizen, 1995). Threading can be done either with OpenMP, Boost.Thread or std::thread. In UWLCM we use the OpenMP threading as it was found to be the most efficient. Output in UWLCM is done using the HDF5 output interface that is a part of libmpdata++. It is based on the thread safe version of the C++ HDF5 library. UWLCM, libcloudph++ and libmpdata++ make use of various components of the Boost C++ library (Koranne, 2011). In order to have parallel CPU and GPU computations in UWLCM, functions from GPUs. Here, we present strong scaling tests of standalone libmpdata++. The tests are done using a dry planetary boundary layer setup, which is a part of the libmpdata++ test suite. Grid size is 432x432x51. Tests were done on the Prometheus cluster. Note that all libmpdata++ calculations are done on CPUs. Two implementations of MPI are tested, OpenMPI v4.1.0 and MVAPICH2 v2.3.1. Note that Prometheus has 2 GPU per node, but MVAPICH2 does not support more than 1 GPU per taskprocess, so 2 tasks processes per node would need to be ran run in UWLCM-SDM. OpenMPI does not have this limitation.

335
For this reason in the libmpdata++ scalability tests we consider two scenarios, one with two processes per node and the other with one process per node. In the case with two processes per node, each process controls half of the available threads. Test results are shown in fig. A1. In general, better performance is seen with MVAPICH2 than with OpenMPI. Running two tasks processes per node improves perfromance in MVAPICH2, but decreases performance in OpenMPI. In the best case, scaling efficiency exceeds 80% for up to 500 threads.      Results of LES of a raising thermal done on the Prometheus cluster for different scaling scenarios.   time assuming perfect scaling t 1 /N nodes max(t 1 /N nodes , t 2 GPU ) b t 1 t 1 a Assuming that grid scaling is used to refine the resolution, as done in this paper.
If it is done to increase the domain, data transfer size per GPU scales as the one per CPU. b GPU time from two nodes simulation is taken as reference, because it is ca. 15% lower than on a single node. A plausible explanation for this is that, although the number of SD per GPU does not depend on the number of nodes, GPUs also store information about conditions in grid cells, and the amount of grid cells per GPU decreases as more nodes are used. For more than 2 nodes, GPU calculation time is approximately the same as for 2 nodes.