An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems

https://doi.org/10.1016/j.jpdc.2011.07.012Get rights and content

Abstract

The numerical solution of two-layer shallow water systems is required to simulate accurately stratified fluids, which are ubiquitous in nature: they appear in atmospheric flows, ocean currents, oil spills, etc. Moreover, the implementation of the numerical schemes to solve these models in realistic scenarios imposes huge demands of computing power. In this paper, we tackle the acceleration of these simulations in triangular meshes by exploiting the combined power of several CUDA-enabled GPUs in a GPU cluster. For that purpose, an improvement of a path conservative Roe-type finite volume scheme which is specially suitable for GPU implementation is presented, and a distributed implementation of this scheme which uses CUDA and MPI to exploit the potential of a GPU cluster is developed. This implementation overlaps MPI communication with CPU–GPU memory transfers and GPU computation to increase efficiency. Several numerical experiments, performed on a cluster of modern CUDA-enabled GPUs, show the efficiency of the distributed solver.

Highlights

► Acceleration of two-layer shallow water system simulations on GPU clusters. ► Improvement of a Roe-type finite volume scheme, suitable for CUDA-enabled GPUs. ► Overlapping communication and computation with MPI and CUDA. ► Management of unstructured meshes on a distributed multi-GPU platform. ► Weak and strong scaling close to perfect.

Introduction

The two-layer shallow water system has been used as the numerical model to simulate several phenomena related to stratified geophysical flows such as ocean currents, oil spills or tsunamis generated by underwater landslides. The simulation of these phenomena requires long lasting simulations in big domains, and the implementation of efficient numerical schemes to solve these problems on parallel platforms seems to be a suitable way of achieving the required performance in realistic applications.

A cost effective way of improving substantially the performance in these applications is the use of Graphics Processor Units (GPUs). These platforms make it possible to achieve speedups of an order of magnitude over a standard CPU in many applications and are growing in popularity [32], [29]. Moreover, several programming toolkits such as CUDA [28] have been developed to facilitate the programming of GPUs for general purpose applications.

There are previous works to port finite volume one-layer shallow water solvers to a GPU by using a graphics-specific programming language [18], [24], [25], but currently most of the proposals to simulate shallow flows on a single GPU are based on the CUDA programming model. A CUDA solver for one-layer system based on the first order finite volume scheme presented in [5] is described in [9] to deal with structured regular meshes. The extension of this CUDA solver for two-layer shallow water system is presented in [10]. There also exist proposals to implement, using CUDA-enabled GPUs, high order schemes to simulate one-layer systems [7], [3], [15] and to implement first-order schemes for one and two-layer systems on triangular meshes [6].

Although the use of single GPU systems makes it possible to satisfy the performance requirements of several applications, many applications require huge meshes, large numbers of time steps and even real time accurate predictions (for instance, to approximate the effect of an unexpected oil spill). These characteristics suggest combining the power of multiple GPUs.

One approach to use several GPUs is based on programming shared memory multi-GPU desktop systems. These platforms have been used in fluid dynamics [35] and shallow water [16], [33] simulations by combining shared memory programming primitives to manage threads in the CPU and CUDA to program the GPU. However, this cost-effective approach only offers a reduced number of GPUs (2–8 GPUs) and more flexible systems are desirable.

A more flexible approach involves the use of clusters of GPU-enhanced computers where each node is equipped with a single GPU or with a multi-GPU system. The computation on GPU clusters could make it possible to scale the runtime reduction according to the number of GPUs (which can be easily increased). Thus, this approach is more flexible than using a multi-GPU desktop system and the memory limitations of a GPU-enhanced node can be overcome by suitably distributing the data among the nodes, enabling us to simulate significantly larger realistic models and with greater precision. The use of GPU clusters to accelerate intensive computations is gaining in popularity [23], [22], [11], [36], [1], [21]. In [2], a one-layer shallow water solver is implemented on a GPU cluster for tsunami simulation. Most of the proposals to exploit GPU clusters use MPI [26] to implement interprocess communication and CUDA [28] to program each GPU. A common way to reduce the remote communication overhead in these systems consists in using non-blocking communication MPI functions to overlap the data transfers between nodes with GPU computation and CPU–GPU data transfers.

In this work, an implementation of an improved finite volume scheme is developed for a GPU cluster by using MPI and CUDA. This implementation incorporates an efficient management of the distributed unstructured mesh and mechanisms to overlap computation with communication.

The outline of the article is as follows: the next section describes the underlying mathematical model and presents an improvement of a first order Roe type finite volume scheme, called IR-Roe scheme. Section 3 describes a data parallel version of the IR-Roe scheme. Several implementations of the IR-Roe scheme and the classical Roe scheme [5], [4] are compared in Section 4. In the two next sections we describe a single and a multi-GPU distributed implementation, respectively, of the method for triangular meshes. Section 7 analyses the experimental results obtained when the implementations are applied to solve an internal dam break problem on a cluster of 4 GPUs. Finally, Section 8 summarizes the conclusions and presents the future work.

Section snippets

The two-layer shallow water system

Let us consider the system of equations governing the 2d flow of two superposed immiscible layers of shallow fluids in a subdomain ΩR2: Wt+F1x(W)+F2y(W)=B1(W)Wx+B2(W)Wy+S1(W)Hx+S2(W)Hy, where W=(h1q1,xq1,yh2q2,xq2,y)T,F1(W)=(q1,xq1,x2h1+12gh12q1,xq1,yh1q2,xq2,x2h2+12gh22q2,xq2,yh2)TF2(W)=(q1,yq1,xq1,yh1q1,y2h1+12gh12q2,yq2,xq2,yh2q2,y2h2+12gh22)T,Sk(W)=(0gh1(2k)gh1(k1)0gh2(2k)gh2(k1))T,k=1,2,Bk(W)=(0P1,k(W)rP2,k(W)0)Pl,k(W)=(000ghl(2k)00ghl(k1)00)l=1,2. Index 1 in the

Parallelization of the scheme

Fig. 1(a) shows a graphical description of the parallel algorithm, obtained from the description of the IR-Roe numerical scheme given in Section 2. The main calculation phases are identified with circled numbers, and the main sources of data parallelism are represented with overlapping rectangles.

Initially, the finite volume mesh is built. Then the time stepping process is repeated until the final simulation time is reached. At the (n+1)-th time step, Eq. (4) must be evaluated to update the

Roe schemes comparison

In this section we will compare the efficiency of several implementations of the IR-Roe method and the classical Roe scheme introduced in [5].

We consider an internal circular dambreak problem in the [5,5]×[5,5] rectangular domain. The depth function is H(x,y)=5, and the initial condition is: Wi0(x,y)=(h1(x,y),0,0,h2(x,y),0,0)T, where: h1(x,y)={4if x2+y2>1.50.5otherwiseh2(x,y)=5h1(x,y). The numerical scheme is run for several triangular meshes (see Table 1). The simulation time interval is

CUDA implementation of the IR-Roe method

The CUDA implementation of the algorithm exposed in Section 3 is a variant of the implementation described in [6, Section 7.3]. The general steps of the implementation are depicted in Fig. 1(b). Each step executed on the GPU is assigned to a CUDA kernel and corresponds to a calculation phase described in Section 3. Next, we briefly describe each step:

Build data structure: Volume data is stored in two arrays of Lfloat4 elements as 1D textures, where each element contains the data (state, depth

Implementation on a GPU cluster

In this section a multi-GPU extension of the CUDA implementation detailed in Section 5 is proposed. Basically, the triangular mesh is divided into several submeshes and each submesh is assigned to a CPU process, which, in turn, uses a GPU to perform the computations related to its submesh. We use MPI [26] for the communication between processes. Next we describe how the data of a particular submesh is created and stored in GPU memory.

Experimental results

In this section we will test the single and multi-GPU implementations described in Sections 5 CUDA implementation of the IR-Roe method, 6 Implementation on a GPU cluster, respectively. The test problem and the parameters are the same that were used in Section 4.

We have used the Chaco software [20] to divide a mesh into equally sized submeshes, the OpenMPI implementation [14] and the GNU compiler. All the programs were executed in a cluster formed by four Intel Xeon servers with 8 GB RAM each

Conclusions and future work

In this paper we have presented an improvement of a first order well-balanced Roe-type finite cell solver for a two-layer shallow water system. This numerical scheme has proved to be computationally more efficient than the classical Roe scheme and is more suitable to be implemented in modern CUDA-enabled GPUs than the classical Roe scheme. A multi-GPU distributed implementation of this scheme that works on triangular meshes and overlap remote communications with GPU computation has been

Acknowledgments

The authors acknowledge partial support from the DGI-MEC projectsMTM2008-06349-C03-03, MTM2009-11923 and MTM2009-07719.

Marc de la Asunción received his B.S. degree in computer science in 2003, and two M.S. degrees in 2007 and 2009, respectively, from the University of Granada, Spain. He is currently working towards his Ph.D. degree in shallow water simulations on graphics processing units. His research interests include parallel computing and general-purpose computing on graphics processing units (GPGPU).

References (36)

  • M. Acuña et al.

    Real-time tsunami simulation on multi-node GPU cluster

    The Journal of Supercomputing

    (2009)
  • A. Brodtkorb et al.

    Simulation and visualization of the Saint-venant system using GPUs

    Computing and Visualization in Science

    (2011)
  • M. Castro et al.

    High order extension of Roe schemes for two dimensional nonconservative hyperbolic systems

    Journal of Scientific Computing

    (2009)
  • M.J. Castro et al.

    GPU computing for shallow water flow simulation based on finite volume schemes

    Boletın de la Sociedad Española de Matemática Aplicada

    (2010)
  • B. Chapman et al.

    Using openMP: portable shared memory

    International Journal of Parallel Programming

    (2007)
  • M. de la Asunción et al.

    Simulation of one-layer shallow water systems on multicore and CUDA architectures

    The Journal of Supercomputing

    (2010)
  • M. de la Asunción et al.

    Programming CUDA-based GPUs to simulate two-layer shallow water flows

  • Z. Fan et al.

    GPU cluster for high performance computing

  • Cited by (28)

    • Parallel high-order resolution of the Shallow-water equations on real large-scale meshes with complex bathymetries

      2022, Journal of Computational Physics
      Citation Excerpt :

      On regular structured meshes, their computation is quite straightforward [40,26]. On unstructured meshes, however, it becomes much more complex and requires dedicated algorithms [27,11,31]. The number of required ghost-layers depends on the discretization's stencil.

    • Modelling hyperconcentrated floods in the Middle Yellow River using an improved river network model

      2020, Catena
      Citation Excerpt :

      The commonly used parallel mechanisms include the shared memory architecture on graphics processing units or a multi-core processor, the massage passing interface between distributed memory machines, and the mixed use of these methods. All the above parallel approaches had already been widely used in 2D shallow water modelling and 3D turbulent flow modelling (De La Asunción et al., 2012; Thomas and Baeder, 2013; Zhang et al., 2014), and however, there are little reports on how to develop a parallel program for river network models. The Middle Yellow River cuts through the Loess Plateau and the average annual sediment load (in the 2000s) increased from 38.9 million tonnes at Hekouzhen to 190.3 million tonnes at Longmen (Miao et al., 2011).

    • Efficient multilayer shallow-water simulation system based on GPUs

      2018, Mathematics and Computers in Simulation
    View all citing articles on Scopus

    Marc de la Asunción received his B.S. degree in computer science in 2003, and two M.S. degrees in 2007 and 2009, respectively, from the University of Granada, Spain. He is currently working towards his Ph.D. degree in shallow water simulations on graphics processing units. His research interests include parallel computing and general-purpose computing on graphics processing units (GPGPU).

    José M. Mantas is a Senior Lecturer in Computer Science at the Software Engineering Department of the University of Granada (Spain) where he teaches several subjects (undergraduate and postgraduate) related to parallel programming. He has participated in ten research projects about parallel processing, distributed systems and scientific computing. He has published numerous papers in international journals and conference proceedings in the areas of high performance computing and scientific computing.

    Manuel J. Castro is a researcher in the field of Numerical Analysis of Partial Differential Equations. His contributions to this field and, in particular to the numerical analysis of non-conservative hyperbolic systems (with special emphasis on shallow water systems), are showing an increasing impact within the community of applied mathematicians. He has won the ECCOMAS J.L. Lions Award for Young Scientists in Computational Mathematics (2008). He has co-directed 3 Ph.D. thesis as well as several Masters thesis and student projects. His research interests are: Numerical Analysis of balance laws and non-conservative hyperbolic systems; Geophysical flows, ….

    E.D. Fernández-Nieto is Assistant Professor at the University of Seville, Spain, in the Department of Applied Mathematics I. In 2003, the city of Seville has awarded his Ph.D. thesis as the best one of the University of Seville in the scientific area. He has received the Young researcher Award 2009 from the Spanish Society of Applied Mathematics (Sema). He has co-directed two Ph.D. l thesis, with professors M.J. Castro Díaz (University of Málaga) and D. Bresch (U. Chambery, France), respectively. He has participated in five research projects. At this time he has 24 articles in international journals with an impact factor in JCR.

    View full text