Performance analysis of Volna-OP2 -- massively parallel code for tsunami modelling

The software package Volna-OP2 is a robust and efficient code capable of simulating the complete life cycle of a tsunami whilst harnessing the latest High Performance Computing (HPC) architectures. In this paper, a comprehensive error analysis and scalability study of the GPU version of the code is presented. A novel decomposition of the numerical errors into the dispersion and dissipation components is explored. Most tsunami codes exhibit amplitude smearing and/or phase lagging/leading, so the decomposition shown here is a new approach and novel tool for explaining these occurrences. It is the first time that the errors of a tsunami code have been assessed in this manner. To date, Volna-OP2 has been widely used by the tsunami modelling community. In particular its computational efficiency has allowed various sensitivity analyses and uncertainty quantification studies. Due to the number of simulations required, there is always a trade-off between accuracy and runtime when carrying out these statistical studies. The analysis presented in this paper will guide the user towards an acceptable level of accuracy within a given runtime.


Introduction
The software package Volna-OP2 is used for the simulations of tsunami modelling by a number of research groups around the world since its first introduction in 2011 [1]. The driving force for developing the code was a need within the tsunami research community for a solver which was applicable for analysis of realistic tsunami events and aimed to aid operational tsunami research [1,2]. The code solves the depth-averaged Nonlinear Shallow Water Equations (NSWE) in two horizontal dimensions (x, y) using modern numerical methods for solution of hyperbolic systems. Volna-OP2 can efficiently simulate the complete life of a tsunami from generation induced by bathymetry displacement, propagation and inundation onshore. It can be used for cases of a simplified bathymetry represented by a mathematical formula but also for the complex bathymetry and topography of the examined geographical region. Owing to the use of an unstructured triangular mesh, irregular bathymetric and topographic features can be efficiently captured and represented.
The first operational use of Volna-OP2 for a realistic scenario was for the modelling of sliding and tsunami generation in the St. Lawrence estuary in Canada [3]. The code has been used to model varying tsunamigenic episodes [4,5]; in several cases it has been used in conjunction with statistical modelling to perform comprehensive sensitivity analysis tests and uncertainty quantification [6,7,8,9,10].
Both the originally developed and the newly parallelised version of Volna-OP2 have been carefully validated against well known benchmarks available to the tsunami community [1,2]. However, in the present paper, we pay particular attention to the accuracy of the new GPU version of the code with special emphasis on dispersion and dissipation errors as well as its computational efficiency on the general purpose GPU cluster. Gaining a deeper understanding of the numerical errors present can inform the user towards more accurate real case simulations. The manuscript is organised as follows: in the next section (2), we describe the mathematical model and numerical schemes implemented in the parallel version of the code. Section (3) is dedicated to an analytic benchmark used to explore the accuracy of the code and it is followed by the analysis of the dispersion and dissipation errors in Section (4). Section (5) discusses application of the code to real cases and the scalability of its GPU implementation. The paper is wrapped up by concluding remarks and perspective developments in Section (6).

Nonlinear shallow water equations
The shallow water theory is efficiently used to describe the physics of long waves like tsunamis, which are characterised by very large wavelengths in comparison to the depth of the basin over which they propagate. Neglecting dispersion the NSWE yield: where H = (h + η) is the total water depth, described as the sum of the time-dependent bathymetry h(x, y, t) and the free surface elevation η(x, y, t), u(u, v) is the fluid velocity in the x and y horizontal directions, I is the identity matrix and g is the acceleration due to gravity. Provided that H > 0 the system is strictly hyperbolic.
In the wet/dry transition the system starts to become non-hyperbolic since H = 0 in a dry region. To deal with that an algorithm that solves the shoreline Riemann problem developed by [11] is implemented in the code.

Spatial discretisation
A cell-centered Finite Volume (FV) numerical method is used for the spatial discretization in Volna-OP2 [1].
Other tsunami codes based on the finite volume approach include Tsunami-HySEA [12] and GeoClaw [13]. The numerical flux implemented in a numerical algorithm has to ensure that some standard conservation and consistency properties are satisfied: the fluxes from adjacent control volumes sharing an interface exactly cancel when summed and the numerical flux with identical state arguments reduces to the true flux of the same state. In Volna-OP2 the Harten-Lax-van Leer (HLL) numerical flux was selected to ensure these conditions are met [1]. The HLL approximate Riemann solver was proposed by Harten et al. in 1983 and assumes a twowave configuration for the exact solution [14]. The wave speed chosen according to [15] yields a very robust approximate Riemann solver. The Riemann solver models two waves that travel with speeds s L and s R , the larger signal velocity is represented by s R and the smaller by s L ; three states are identified (Fig 1). The subscripts R and L are used to represent the right and left cell values respectively. The intermediate state is denoted by w * , where w is a vector of the conserved variables (H, Hu, Hv). The numerical flux function of the scheme can be described by: where w L , w R are the two interface states and F L, * ,R denotes the true flux at state w L, * ,R respectively. The right and left states are known. The intermediate state can be determined by applying the Rankine-Hugoniot conditions twice [1]. It then derives that: the wave speeds are computed as: where u nL = u L · n LR , u nR = u R · n LR and u * n and c * are equal to: where c R = √ gH R and c L = √ gH L are the gravity wave speeds for the right and left state of the system respectively and n LR denotes the vector along the shared face between the right and left states. The shortcoming of the HLL scheme is that it cannot resolve isolated contact discontinuities. It can thus become quite dissipative.

Temporal discretisation
A Strong Stability-Preserving (SSP) method is used in conjunction with a Runge-Kutta method for the temporal discretization in VOLNA. In the current version of the code the optimal second order two stage Runge-Kutta scheme SSP-RK(2,2) is used, with optimal Courant-Friedrichs-Lewy (CFL) condition equal to 1. The scheme is given as follows: Where L( w) is defined as the finite volume space discretization operator. The stability of the scheme is guaranteed if the CFL condition is satisfied. The Runge-Kutta scheme is very robust, especially in handling discontinuities. However, the scheme is both dissipative and dispersive [16]. Dissipativity causes a leak of energy from the system while dispersion leads to either phase lag or phase lead. A full explanation of these phenomena is given in Section (4).

Second order extension
The classical finite volume schemes are only first order accurate in space, which is insufficient for most modern computational simulations. Simulating a real tsunami case with a first order accurate scheme would require an unfeasible mesh resolution to obtain meaningful results. So in order to yield second order accuracy in space, a reconstruction technique is implemented. One must ensure that the scheme is total variation diminishing (TVD), i.e no artificial maxima and minima are introduced. Thus, within Volna-OP2 a MUSCL (Monotone Upstream-centered Scheme for Conservation Laws) scheme is implemented. The second order spatial accuracy is achieved by reconstructing the conserved variables on the cell interfaces. The reconstruction relies on calculating the gradient of a conserved variable over a cell and projecting the reconstructed value on the interface. Within Volna-OP2, a least squares gradient reconstruction and Barth-Jesperson limiter [17] are implemented. The reconstructed values given below (3) are then used in the numerical flux calculation: where w is a vector of conserved variables, w( x f ) is the conserved variable evaluated at the interface, α K is the cell specific conserved variable limiter, (∇ w) K is the cell centred gradient and x f − x 0 is a vector pointing from the cell-centre to the face centre. To avoid large gradients being calculated in the wet/dry region of the domain a threshold depth has been introduced to ensure that the code remains stable. When the depth goes below this threshold the scheme doesn't carry out a reconstruction. This threshold depth has been set to be H threshold = 10 −6 m. However, it has been found that a conservative value can ensure greater stability, for example in Section (3), H threshold = 10 −5 m. For the real case (Section (5)), H threshold = 10 −3 m. At present these values are found through a trial and error approach but more research is required on the optimisation of this threshold depth.

Boundary Conditions -Wall/Solid Boundary
For a full explanation on the treatment of boundary conditions the reader is referred to [1]. However, the case of a wall/solid boundary is given here. For all boundary conditions a ghost cell technique is used. This approach allows one to reconstruct the conserved variables on the boundary and thus preserve second order accuracy.
Values of the conserved variables on the ghost cells are defined based on the type of boundary condition needed.
In the following, cell L is defined to be inside and cell R (ghost cell) is outside of the domain. The boundary of the domain is the common edge between cell L and cell R.
For a wall/solid boundary, u · n = 0, where u is the flow velocity and n is the normal vector to the boundary edge. To ensure that this is satisfied, the tangential ( u ) and normal ( u ⊥ ) velocities for the ghost cell are set to be equal and opposite to the those of the interior cell respectively:

Benchmark Test with Analytical Solution
The two dimensional case of a radially symmetric paraboloid is implemented following the analytic solution initially proposed by Thacker [18]. This solution is available in the SWASHES (Shallow-Water Analytic Solutions for Hydraulic and Environmental Studies) library [19]. The major aim of SWASHES is to aid numerical modellers to validate shallow water equation solvers. The oscillatory motion of the paraboloid is described by a periodic solution in which damping is assumed to be negligible. The morphology of the domain is a paraboloid of revolution given by: where L is the length of the domain, h 0 is the water depth at the central point of the domain when the shoreline elevation is zero and α is the horizontal distance from the central point to the shoreline with zero elevation (Fig 2). 0 0 x L α r0 h0 Figure 2: Geometry of the domain used for the analytic solution following [19] The free surface elevation h(r, t) and the velocity components u(x, y, t) and v(x, y, t) are then given by: where ω = √ 8gh 0 /α is the frequency, r 0 is the distance from the central point of the domain to the initial shoreline location and A = (α 2 − r 2 0 )/(α 2 + r 2 0 ). To model the solution we follow the values proposed in [19], where α = 1m, r 0 = 0.8m, h 0 = 0.1m, L = 4m, and T = 3(2π/ω).
We record the free surface elevation in three positions ( Fig (3) highlights a top-down view of the bathymetry and the locations of the wave gauges. In numerical simulations with Volna-OP2 we model the free surface elevation at the three positions with various spatial resolutions as a function of time up to t f in = 10s. An analytic solution at time t = 0 is chosen as an initial condition and ∆t = 0.45∆x for the simulations. This was chosen as it was found to be stable for all the mesh resolutions. Bathymetry (m) Figure 3: Plot of the bathymetry (parabolic bowl) using the parameters outlined by [19]. The colour coding matches the height of the bathymetry. The locations of the virtual gauges are marked by the black stars.
The results of the numerical simulations are shown in (Fig 4) -(Fig 6). where the analytic solution -black dashed, ∆x = ∆y = 0.024m -green, ∆x = ∆y = 0.012m -blue, and ∆x = ∆y = 0.006m -red. In the numerical simulations, ∆t = 0.45∆x. The dashed box is the boundaries of (Fig 7).   The finest mesh (∆x = ∆y = 0.006m) yields a representation closest to the surface elevation given by the analytic solution. The discrepancies between the meshes can only be seen by zooming in on the plots (Fig 7). Focusing on the centre of the domain, we plot the absolute difference between the analytic and the numerical free surface elevation over time (Fig 8). As expected, the numerical error is always larger for the coarser meshes. In order to gain an idea on the numerical order of the scheme, a convergence study was carried out. The L ∞ norm is calculated at 10s for various mesh resolutions (0.048 -0.003)m and then plotted versus the characteristic mesh size (Fig 9).  The results shown in (Fig 8) and (Fig 9) highlight that the scheme is second order accurate in space. However, the role of numerical dispersion and dissipation has not been explored and they should be accounted for when running long time tsunami simulations. We come back to this discussion in section (4).
To check the temporal discretization error we keep the mesh size constant at ∆x = ∆y = 0.006m and  (Fig 11a and Fig 11b).  Changes in time-step within the stability limits while keeping the same spatial resolution almost do not affect the accuracy of the solution (Fig 10). This is highlighted in Fig 11 where the plots of the numerical solution with various time steps overlap each other. However, comparing the subplots of Fig 11 as time goes on, one observes a damping of the numerical signal and a phase shift. The Runge-Kutta scheme implemented in the code is dissipative and we could expect a leak of the energy from the system, thus we can expect the damping of the numerical signal. Reasons for the phase shift in the signal will be explained in section (4). Overall, the results from this section show that the space discretisation has a strong influence on the solution, while reducing the time step has no visible effect.

Error analysis
The exact solution of the discretized equations satisfies a PDE which is generally different from the one to be solved. The original equation is replaced with the modified equation Au n+1 = Bu n , or, in other words The even-order derivatives on the right-hand side produce an amplitude error, or numerical dissipation. The odd-order derivatives on the right-hand side produce a wave-number-dependent phase error known as numerical dispersion. In the long time simulations, the numerical behaviour of the scheme largely depends on the role played by the dispersive and dissipative effects also known as "wiggles" (phase errors) and "smearing" (amplitude errors) respectively. A negative dispersion coefficient corresponds to phase lagging (i. e. harmonics travel too slowly), while positive dispersion coefficients yield phase leading with spurious oscillations occurring ahead of the wave. According to the Lax-Richtmyer Equivalence Theorem [20], if a scheme has a truncation error of order (p, q) and the scheme is stable, then the difference between the analytic solution and the numerical solution in an appropriate norm is of the order (∆t) p + h q for all finite time. It has been observed numerically (see Fig 9) that the numerical solution is second order accurate. Taking into account that the time step is proportional to the spatial resolution that we call h, we can write that the total error is of the order O(h 2 ). To analyse the role played by the dissipation and dispersion errors, we rewrite the error as where C p−1 (t) incorporates all the constants included in the error formula. We choose the three leading terms of this expansion: The first term in this expansion corresponds to the truncation error (also the leading dissipation error). It is followed by the leading dispersion and the secondary dissipation error terms. We assume that the remaining terms are significantly smaller and can be neglected. Next we go back to the simulations with various spatial resolution discussed in Section (3). For each of the three grids, we have the absolute error as a function of time. If we define the spatial resolution h = 0.024m, two other grids have the resolution h/2 and h/4. The system of equations has the form: and its solution is The plots below show the composition of the total error for each of the grids (Fig 12 -Fig 14) , with a scaled surface elevation overlaid to give an idea on when the errors occur.    Figure 14: Decomposition of the absolute error at the centre of the domain (x = y = 0m) for a spatial grid with resolution of ∆x = ∆y = 0.006m and CFL=0.45. The colours correspond to: the truncation error -green, the third order error -red, the fourth order error -blue, the total error -black. The black dashed line is a scaled plot of the surface elevation at the centre of the domain over time.
In all the error decompositions, the components tend to cancel each other, which results in the total error being less than the individual components. However, the leading dispersion error is a dominant component for all the total errors. The leading dispersion error exhibits large negative spikes, this points towards phase lagging as the surface elevation changes from negative to positive at the center of the domain. These large negative spikes in the leading dispersion error coincide with positive spikes in either the truncation or 4 th order dissipative errors. For finer grids the truncation error (leading dissipation error) plays a dominant role. For the larger grid the negative spikes are balanced by the 4 th order dissipative error (Fig 12).
This error analysis is important when considering real cases -see section (5) -as any error could be dominated by either the leading dispersion or dissipation terms. However, this numerical dispersion term could compensate for the fact that physical dispersion is neglected in the nonlinear shallow water equations, as shown by [21].

Real cases
In this section we discuss an actual tsunami simulation done with Volna-OP2 running on the general purpose GPU cluster. The cluster consists of two NVIDIA ® Tesla ® V100 cards, with 5,120 CUDA cores, 16GB max memory size each.
The domain size is 800×1000km and the physical simulation time is 1 hour. The simulation corresponds to a hypothetical scenario of edge volume collapse (765 km 3 ) at the Rockall Bank Slide Complex. A geophysical study of the event taking into account volumetric, rheological and multiphase collapse considerations has been done under a different framework [22]. Convergence using a simple approach of material with visco-plastic rheology sliding in one go is demonstrated. For this study we have chosen four gauges marked by the red dots on  Overall, the numerical simulations with varying spatial discretisation behave similarly. Notable differences can be seen at gauge 4 (Fig 16d), where unresolved bathymetric features of the continental shelf play a role. To investigate the numerical differences we will focus on the output at gauge 3 (Fig 16c). The reason for choosing this gauge is to minimise the effects of these unresolved bathymetric features as the bathymetry between this gauge location and the landslide source is relatively flat. The maximum wave amplitude of the initial tsunami wave at gauge 3 (Fig 16c) has been highlighted above. As there is no true solution to compare with for this real case we will take the simulation results from the finest mesh ∆x = ∆y = 450m as the ground truth. Relative differences between this ground truth and the other simulations are presented in Fig 17 and table 1. When comparing the signals (Fig 17), the coarser meshes exhibit phase lagging and/or damping of the signal, i.e. the maximum tsunami wave arrives later and its amplitude is diminished. This behaviour was highlighted in the previous error analysis -Section 4 -and is thus expected.
It should be noted that for cases which utilize non-uniform mesh resolutions the same error analysis findings will hold true. If the mesh is non-uniform (i.e the characteristic length scale of the cells vary across the domain), the analysis addresses the worst possible scenario and scales all cells by the largest for error computations. Thus, one would expect to see similar behaviour regarding numerical dissipation and dispersion.  Figure 17: Zoom in on the maximum wave amplitudes of the initial tsunami wave at gauge 3 (Fig 16c) Table 1: Relative differences in wave height and arrival time of initial wave at gauge 3 between the coarser meshes and finest one (Fig 16). δη = the difference in maximum wave height and δt = the time delay between the arrival of the maximum wave.
Turning to the performance of Volna-OP2 on the GPU cluster, the table (2) and plot (Fig 18) summarise the runtimes for the various mesh resolutions. One can see that we get a linear speed up of the runtimes. Those interested in the scalability of the code on other HPC architectures are referred to [2]. This analysis of the relative errors (1) and computational efficiency informs the user on what resolution will provide an acceptable level of accuracy within a given time constraint. log(Runtime (s)) 1 Gpu 2 Gpus Figure 18: Runtimes of the various mesh resolutions with Volna-OP2 on a GPU cluster.

Concluding remarks
Based on the current study we can conclude that Volna-OP2 is a robust and efficient parallel solver for the NSWE. It is based on the finite volume scheme for spatial integration, implementing a MUSCL reconstruction and using the 2nd order Runge-Kutta scheme for integration in time. The scheme is conditionally stable with experimentally confirmed CFL=1.0. The code can handle complex geometries and simulate real-life cases. The error analysis shows that it scales quadratically with refining the spatial mesh. However, reducing the time-step does not have a visible effect on the error. The solution amplitude decays in time, and one can observe both damping and phase shifts. So there is an energy leak from the system, which is expected due to the non-conservative Runge-Kutta scheme. However, the error can be minimised by reducing the spatial resolution. The phase error is mostly negative, which corresponds to the phase lag and the artificial wiggles behind the wave front. However, it changes sign occasionally and this leads to the phase lead.
The real case simulations show the efficiency and scalability of the code run on a GPU cluster. The 1hour realistic size tsunami model is simulated in ∼ 24 mins on the finest (450m resolution, ∼ 5.1M cells) mesh using two GPUs. However, the simulations have shown that comparable results can be achieved using a coarser mesh and reduced runtime. Therefore the users should be familiar with code pros and cons before setting up their simulation in order to get the physically meaningful and numerically accurate results. This acceptable level of accuracy and runtime trade off is an important decision when using Volna-OP2 to perform comprehensive sensitivity analysis tests and uncertainty quantification.
To conclude the paper we want to mention that Volna-OP2 is still under development. Therefore its analysis and benchmark testing play an important role for the code's continuous enhancement and improvement. The code is already an important tool for tsunami modelling community. However, we hope that our work will help to its wider adoption and will lead to discussion on the most suitable algorithms and software platforms for realistic tsunami modelling and prediction of its effects.
Guillas acknowledges support from the Alan Turing Institute project "Uncertainty Quantification of multiscale and multiphysics computer models: applications to hazard and climate models" as part of the grant EP/N510129/1 made to the Alan Turing Institute by EPSRC.