An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems
Highlights
► Acceleration of two-layer shallow water system simulations on GPU clusters. ► Improvement of a Roe-type finite volume scheme, suitable for CUDA-enabled GPUs. ► Overlapping communication and computation with MPI and CUDA. ► Management of unstructured meshes on a distributed multi-GPU platform. ► Weak and strong scaling close to perfect.
Introduction
The two-layer shallow water system has been used as the numerical model to simulate several phenomena related to stratified geophysical flows such as ocean currents, oil spills or tsunamis generated by underwater landslides. The simulation of these phenomena requires long lasting simulations in big domains, and the implementation of efficient numerical schemes to solve these problems on parallel platforms seems to be a suitable way of achieving the required performance in realistic applications.
A cost effective way of improving substantially the performance in these applications is the use of Graphics Processor Units (GPUs). These platforms make it possible to achieve speedups of an order of magnitude over a standard CPU in many applications and are growing in popularity [32], [29]. Moreover, several programming toolkits such as CUDA [28] have been developed to facilitate the programming of GPUs for general purpose applications.
There are previous works to port finite volume one-layer shallow water solvers to a GPU by using a graphics-specific programming language [18], [24], [25], but currently most of the proposals to simulate shallow flows on a single GPU are based on the CUDA programming model. A CUDA solver for one-layer system based on the first order finite volume scheme presented in [5] is described in [9] to deal with structured regular meshes. The extension of this CUDA solver for two-layer shallow water system is presented in [10]. There also exist proposals to implement, using CUDA-enabled GPUs, high order schemes to simulate one-layer systems [7], [3], [15] and to implement first-order schemes for one and two-layer systems on triangular meshes [6].
Although the use of single GPU systems makes it possible to satisfy the performance requirements of several applications, many applications require huge meshes, large numbers of time steps and even real time accurate predictions (for instance, to approximate the effect of an unexpected oil spill). These characteristics suggest combining the power of multiple GPUs.
One approach to use several GPUs is based on programming shared memory multi-GPU desktop systems. These platforms have been used in fluid dynamics [35] and shallow water [16], [33] simulations by combining shared memory programming primitives to manage threads in the CPU and CUDA to program the GPU. However, this cost-effective approach only offers a reduced number of GPUs (2–8 GPUs) and more flexible systems are desirable.
A more flexible approach involves the use of clusters of GPU-enhanced computers where each node is equipped with a single GPU or with a multi-GPU system. The computation on GPU clusters could make it possible to scale the runtime reduction according to the number of GPUs (which can be easily increased). Thus, this approach is more flexible than using a multi-GPU desktop system and the memory limitations of a GPU-enhanced node can be overcome by suitably distributing the data among the nodes, enabling us to simulate significantly larger realistic models and with greater precision. The use of GPU clusters to accelerate intensive computations is gaining in popularity [23], [22], [11], [36], [1], [21]. In [2], a one-layer shallow water solver is implemented on a GPU cluster for tsunami simulation. Most of the proposals to exploit GPU clusters use MPI [26] to implement interprocess communication and CUDA [28] to program each GPU. A common way to reduce the remote communication overhead in these systems consists in using non-blocking communication MPI functions to overlap the data transfers between nodes with GPU computation and CPU–GPU data transfers.
In this work, an implementation of an improved finite volume scheme is developed for a GPU cluster by using MPI and CUDA. This implementation incorporates an efficient management of the distributed unstructured mesh and mechanisms to overlap computation with communication.
The outline of the article is as follows: the next section describes the underlying mathematical model and presents an improvement of a first order Roe type finite volume scheme, called IR-Roe scheme. Section 3 describes a data parallel version of the IR-Roe scheme. Several implementations of the IR-Roe scheme and the classical Roe scheme [5], [4] are compared in Section 4. In the two next sections we describe a single and a multi-GPU distributed implementation, respectively, of the method for triangular meshes. Section 7 analyses the experimental results obtained when the implementations are applied to solve an internal dam break problem on a cluster of 4 GPUs. Finally, Section 8 summarizes the conclusions and presents the future work.
Section snippets
The two-layer shallow water system
Let us consider the system of equations governing the flow of two superposed immiscible layers of shallow fluids in a subdomain : where , Index 1 in the
Parallelization of the scheme
Fig. 1(a) shows a graphical description of the parallel algorithm, obtained from the description of the IR-Roe numerical scheme given in Section 2. The main calculation phases are identified with circled numbers, and the main sources of data parallelism are represented with overlapping rectangles.
Initially, the finite volume mesh is built. Then the time stepping process is repeated until the final simulation time is reached. At the ()-th time step, Eq. (4) must be evaluated to update the
Roe schemes comparison
In this section we will compare the efficiency of several implementations of the IR-Roe method and the classical Roe scheme introduced in [5].
We consider an internal circular dambreak problem in the rectangular domain. The depth function is , and the initial condition is: , where: The numerical scheme is run for several triangular meshes (see Table 1). The simulation time interval is
CUDA implementation of the IR-Roe method
The CUDA implementation of the algorithm exposed in Section 3 is a variant of the implementation described in [6, Section 7.3]. The general steps of the implementation are depicted in Fig. 1(b). Each step executed on the GPU is assigned to a CUDA kernel and corresponds to a calculation phase described in Section 3. Next, we briefly describe each step:
–Build data structure: Volume data is stored in two arrays of float4 elements as 1D textures, where each element contains the data (state, depth
Implementation on a GPU cluster
In this section a multi-GPU extension of the CUDA implementation detailed in Section 5 is proposed. Basically, the triangular mesh is divided into several submeshes and each submesh is assigned to a CPU process, which, in turn, uses a GPU to perform the computations related to its submesh. We use MPI [26] for the communication between processes. Next we describe how the data of a particular submesh is created and stored in GPU memory.
Experimental results
In this section we will test the single and multi-GPU implementations described in Sections 5 CUDA implementation of the IR-Roe method, 6 Implementation on a GPU cluster, respectively. The test problem and the parameters are the same that were used in Section 4.
We have used the Chaco software [20] to divide a mesh into equally sized submeshes, the OpenMPI implementation [14] and the GNU compiler. All the programs were executed in a cluster formed by four Intel Xeon servers with 8 GB RAM each
Conclusions and future work
In this paper we have presented an improvement of a first order well-balanced Roe-type finite cell solver for a two-layer shallow water system. This numerical scheme has proved to be computationally more efficient than the classical Roe scheme and is more suitable to be implemented in modern CUDA-enabled GPUs than the classical Roe scheme. A multi-GPU distributed implementation of this scheme that works on triangular meshes and overlap remote communications with GPU computation has been
Acknowledgments
The authors acknowledge partial support from the DGI-MEC projectsMTM2008-06349-C03-03, MTM2009-11923 and MTM2009-07719.
Marc de la Asunción received his B.S. degree in computer science in 2003, and two M.S. degrees in 2007 and 2009, respectively, from the University of Granada, Spain. He is currently working towards his Ph.D. degree in shallow water simulations on graphics processing units. His research interests include parallel computing and general-purpose computing on graphics processing units (GPGPU).
References (36)
- et al.
A parallel 2D finite volume scheme for solving systems of balance laws with nonconservative products: application to shallow flows
Computer Methods in Applied Mechanics and Engineering
(2006) - et al.
GPU computing for shallow water flow simulation based on finite volume schemes
High Performance Computing
Comptes Rendus Mecanique
(2011) - et al.
A consistent intermediate wave speed for a well-balanced HLLC solver
Comptes Rendus Mathematique
(2008) - et al.
A simulation suite for Lattice-Boltzmann based real-time CFD applications exploiting multi-level parallelism on modern multi- and many-core architectures
Journal of Computational Science
(2011) - et al.
Visual simulation of shallow-water waves
Programmable Graphics Hardware
Simulation Modelling Practice and Theory
(2005) Fluid–solid coupling on a cluster of GPU graphics cards for seismic wave propagation
High Performance Computing
Comptes Rendus Mécanique
(2011)- et al.
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
Journal of Computational Physics
(2010) - et al.
Simulation of shallow-water systems using graphics processing units
Mathematics and Computers in Simulation
(2009) - et al.
Data-intensive document clustering on graphics processing unit (GPU) clusters
Journal of Parallel and Distributed Computing
(2011) - R. Abdelkhalek, H. Calendra, O. Coulaud, J. Roman, G. Latu, Fast seismic modeling and reverse time migration on a GPU...
Real-time tsunami simulation on multi-node GPU cluster
The Journal of Supercomputing
Simulation and visualization of the Saint-venant system using GPUs
Computing and Visualization in Science
High order extension of Roe schemes for two dimensional nonconservative hyperbolic systems
Journal of Scientific Computing
GPU computing for shallow water flow simulation based on finite volume schemes
Boletın de la Sociedad Española de Matemática Aplicada
Using openMP: portable shared memory
International Journal of Parallel Programming
Simulation of one-layer shallow water systems on multicore and CUDA architectures
The Journal of Supercomputing
Programming CUDA-based GPUs to simulate two-layer shallow water flows
GPU cluster for high performance computing
Cited by (28)
Heterogeneous acceleration algorithms for shallow cumulus convection scheme over GPU clusters
2023, Future Generation Computer SystemsParallel high-order resolution of the Shallow-water equations on real large-scale meshes with complex bathymetries
2022, Journal of Computational PhysicsCitation Excerpt :On regular structured meshes, their computation is quite straightforward [40,26]. On unstructured meshes, however, it becomes much more complex and requires dedicated algorithms [27,11,31]. The number of required ghost-layers depends on the discretization's stencil.
Modelling hyperconcentrated floods in the Middle Yellow River using an improved river network model
2020, CatenaCitation Excerpt :The commonly used parallel mechanisms include the shared memory architecture on graphics processing units or a multi-core processor, the massage passing interface between distributed memory machines, and the mixed use of these methods. All the above parallel approaches had already been widely used in 2D shallow water modelling and 3D turbulent flow modelling (De La Asunción et al., 2012; Thomas and Baeder, 2013; Zhang et al., 2014), and however, there are little reports on how to develop a parallel program for river network models. The Middle Yellow River cuts through the Loess Plateau and the average annual sediment load (in the 2000s) increased from 38.9 million tonnes at Hekouzhen to 190.3 million tonnes at Longmen (Miao et al., 2011).
UEB parallel: Distributed snow accumulation and melt modeling using parallel computing
2020, Environmental Modelling and SoftwareEfficient multilayer shallow-water simulation system based on GPUs
2018, Mathematics and Computers in SimulationAn MPI-CUDA approach for hypersonic flows with detailed state-to-state air kinetics using a GPU cluster
2017, Computer Physics Communications
Marc de la Asunción received his B.S. degree in computer science in 2003, and two M.S. degrees in 2007 and 2009, respectively, from the University of Granada, Spain. He is currently working towards his Ph.D. degree in shallow water simulations on graphics processing units. His research interests include parallel computing and general-purpose computing on graphics processing units (GPGPU).
José M. Mantas is a Senior Lecturer in Computer Science at the Software Engineering Department of the University of Granada (Spain) where he teaches several subjects (undergraduate and postgraduate) related to parallel programming. He has participated in ten research projects about parallel processing, distributed systems and scientific computing. He has published numerous papers in international journals and conference proceedings in the areas of high performance computing and scientific computing.
Manuel J. Castro is a researcher in the field of Numerical Analysis of Partial Differential Equations. His contributions to this field and, in particular to the numerical analysis of non-conservative hyperbolic systems (with special emphasis on shallow water systems), are showing an increasing impact within the community of applied mathematicians. He has won the ECCOMAS J.L. Lions Award for Young Scientists in Computational Mathematics (2008). He has co-directed 3 Ph.D. thesis as well as several Masters thesis and student projects. His research interests are: Numerical Analysis of balance laws and non-conservative hyperbolic systems; Geophysical flows, ….
E.D. Fernández-Nieto is Assistant Professor at the University of Seville, Spain, in the Department of Applied Mathematics I. In 2003, the city of Seville has awarded his Ph.D. thesis as the best one of the University of Seville in the scientific area. He has received the Young researcher Award 2009 from the Spanish Society of Applied Mathematics (Sema). He has co-directed two Ph.D. l thesis, with professors M.J. Castro Díaz (University of Málaga) and D. Bresch (U. Chambery, France), respectively. He has participated in five research projects. At this time he has 24 articles in international journals with an impact factor in JCR.