Domain-specific implementation of high order Discontinuous Galerkin methods in spherical geometry

We assess two domain-specific languages included in the GridTools ecosystem as tools for implementing a high-order Discontinuous Galerkin discretization of the shallow water equations. Equations in spherical geometry are considered, thus providing a blueprint for the application of domain-specific languages to the development of global atmospheric models. The results demonstrate that domain-specific languages designed for finite difference/volume methods can be successfully extended to implement a Discontinuous Galerkin solver.


Introduction
It has always been challenging for numerical mathematicians to implement new algorithms in a way that can attain the best possible performance of the underlying platform.The task has become more difficult with the evolution of new computing architectures, such as Graphics Processing Units (GPUs) or Field-programmable Gate Arrays (FPGAs).Scientific programmers are forced to learn computing concepts well outside of their original domain.
Domain-specific languages (DSLs) [8] represent an attempt to separate the concerns of the domain scientist from the complexities of the underlying computer science issues.In our case, the designer formulates her problem in terms of numerical operations, including time-stepping algorithms, linear algebra operations, or Partial Differential Equation (PDE) formulations, while a "backend" takes care of generating the appropriate code for the architecture.While programming languages such as C++ or Fortran attempt to abstract away the hardware, they have not kept pace with the emerging hardware complexity and frequently require compiler directives or add-ons to properly exploit the architecture.
Extensive effort has been made in developing frameworks to ease the challenge of solving PDEs in given geometries.On the far end of the spectrum, software frameworks like FEniCS [2] and Firedrake [21] offer a descriptive language to define the PDE and the given domain, along with initial and boundary conditions.These frameworks then generate the code to solve the problem using finite element methods.The former gives the user little latitude to test a new numerical technique, making it more appropriate for domain scientists relying on standard, widely-supported methods.Firedrake, on the other hand, has a different underlying implementation allowing more flexibility to employ code optimizations, various finite element mesh topologies, as well as parallelization features, such as the support of message-passing, multithreading or GPUs.Firedrake utilizes PyOp2 [20] -a full-fledged DSL for the parallel executions of computational kernels on unstructured meshes or graphs -to allow these interventions.More generally, PyOp2 can be viewed as a DSL for finite element calculations.PyOp2 was subsequently refined into the Psyclone [24] code-generation system, which was specifically designed to extend Fortran codes.The latter was then used to implement the LFRic [15] atmospheric model.Psyclone is essentially a source-code generator and could conceptually address the algorithms we propose here, but we have chosen to evaluate tools which have been developed internally at the Swiss National Supercomputing Centre (CSCS).
Our interests are in the area of weather and climate simulation, where the mesh is often built by extrusion of a two-dimensional horizontal mesh covering, for example, the whole globe.Our goal is to enable climate and numerical weather prediction (NWP) applications to leverage a variety of architectures by utilizing software backends.A first attempt at a C++-embedded DSL specific to the climate and weather domain was STELLA [11], with which the COSMO dynamical core (solver of the non-hydrostatic equations of atmospheric motion) was implemented [23].This prototype was completely replaced by GridTools [1], in which the COSMO and NICAM [16] models have been implemented.The classic implementation of GridTools assumes a Cartesian grid, which explains the choices described in Section 2, however the newest version, released in 2023, also allows for a fully unstructured mesh.
In this paper we evaluate two DSLs included in the GridTools ecosystem, namely the C++-based Galerkin-for-Gridtools (G4GT) and the Python-based GridTools-for-Python (GT4Py) [22], as tools for implementing a high-order Discontinuous Galerkin (DG) method for time-dependent problems, see e.g., [9,13].The GridTools framework was originally designed to support finite difference/volume methods on rectangular grids and, more recently, on unstructured grids.G4GT was a first prototype to extend these tools for finite element problems.GT4Py is a Python layer above GridTools and thus targets the same FD/FV problems, for example [5], but we found that with some extensions it can be reused to implement DG solvers.More specifically, we consider an explicit time discretization and a modal DG spatial discretization for a system of conservation laws.Common, but non-trivial, benchmarks, namely linear advection and the shallow water equations, are used to validate our DSL implementations of a DG method.Extensive literature is available on the results obtained for these benchmarks by a wide range of numerical methods, allowing us to assess the numerical and performance results.One of the two implementations deals with equations in spherical geometry, showing that DSLs can be successfully applied to the development of prototype codes for atmospheric modeling.
The structure of the paper is as follows.The flux formulation of the shallow water equations in latitude-longitude coordinates follows in Section 2. A preliminary implementation in an early C++-based DSL prototype called G4GT is presented in Section 3.1, while the new, Python-based implementation called GT4Py is discussed at length in Section 3.2.The validation of the resulting implementations is discussed in Section 4 and benchmarks are presented in Section 5.In Section 6, we recount our experiences using the DSLs and make suggestions on how to better support finite element codes in these frameworks.

The mathematical model and numerical discretization approach
We are concerned with demonstrating the capabilities of DSLs for implementing numerical solutions of conservation laws.A system of conservation laws can be written as: where u is the vector of conserved variables, F is the flux function and S is the source term.Equation (1) becomes well posed once complemented with appropriate initial conditions and boundary conditions, see, for example, the discussion in [17].Since our goal is the application of DSL tools to models for weather and climate, we choose as main model equations the shallow water equations (SWEs) on the sphere, which are a common benchmark for numerical models in this area.Various formulations of these equations can be found in [29].We consider the Earth's surface as a sphere of radius R, that is parameterized in latitude -longitude (lat-lon) coordinates as a rectangular domain such that the latitude θ ∈ [−π/2, π/2] and the longitude λ ∈ [0, 2π].Denote by Ω = 7.292 × 10 −5 s −1 the Earth's rotation rate, by f = 2Ω sin θ the Coriolis parameter and by g = 9.81 m s −2 the Earth's gravitational acceleration.Furthermore, let î,  denote the longitudinal and latitudinal unit vectors, respectively.For a generic scalar function φ and vector field w = w 1 î + w 2 , the spherical gradient and divergence are defined as: Denoting then h as the thickness of a fluid over the spherical surface (assuming flat orography h b = 0) and v = uî + v the velocity field, which is a tangent vector field to the sphere, the shallow water equations in flux form are written as: It is well known that lat-lon coordinates entail a number of numerical difficulties.Indeed, for a Cartesian lat-lon mesh such as the one depicted in Figure 1, the elements become increasingly distorted as they approach the poles.Moreover, the elements precisely neighboring the poles have a singular edge in physical space (these elements reduce to spherical triangles instead of rectangles).While several alternatives have been considered in the literature, see, e.g., the review in [4], this setting is sufficient for the present purpose of validating GT4Py, which, as discussed in Section 1, could only handle non-Cartesian meshes in the latest release.Furthermore, degree adaptivity techniques as in [26] can help reduce the numerical problems of this extremely simple setting.The equations for the spherical components of the velocity field can then be derived taking into account the non-inertial nature of the rotating reference frame on the sphere, see again [29] (Equations 6-7).We then obtain: Since where we have defined the conserved quantities, fluxes and sources as: We then present an overview of the classical DG method chosen for the demonstration of a DSL implementation.A complete description can be found, among many others, in [9,13].We consider for simplicity the discretization of a scalar conservation law, defined on a two-dimensional rectangular domain.This domain is subdivided into K elements, denoted by D k , for any k = 1 . . .K. We restrict our attention to conforming structured meshes composed of rectangular elements In each element, let V k h be the finite-dimensional space of multivariate polynomials up to a given degree r in each spatial dimension: Consequently, the finite-dimensional space in which we seek the solution is the space of discontinuous polynomials defined as The numerical solution can be viewed as the direct sum of local approximations: where u k h ∈ V k h .Due to (8), we can restrict our attention to a single mesh element, dropping the superscript k for simplicity when necessary.We define the local residual as and impose that it vanishes locally in a Galerkin sense, i.e., for any suitably defined test function φ k h ∈ V k h .After integration by parts, the weak DG formulation is given by: In Equation 9, the physical flux at the element boundary is replaced by a numerical approximation, denoted by f * .This guarantees that the flux is single-valued at each edge, enforcing conservation across any edge.Note that all terms in Equation 9 are local to the k-th element, except the numerical flux, which depends on the neighboring elements.The choice of f * plays a crucial role in the numerical solver's consistency, accuracy and stability.A popular choice of f * is the Rusanov flux, defined as: where k is the index of the neighbor element to k across a given edge, and α ≥ 0 is a large enough stabilization parameter, usually chosen to be an estimate of the largest eigenvalue of the hyperbolic system associated to the conservation law.Finally, n k is the normal unit vector, pointing outwards D k .The local solution u k h in element k is then written as a linear combination of a polynomial basis φ where we have dropped the suffix h in the notation for the polynomial basis.Furthermore, ûk i denotes the polynomial expansion coefficients, and n φ = (p + 1) 2 represents the cardinality of the polynomial basis, where p represents the max-imum degree of the polynomials employed.Multiple choices exist for the basis set used for the local polynomial spaces.In this study, following e.g., [26], we employ a modal DG approach, which relies on bivariate Legendre polynomials.After inserting the basis expansion from Equation (11) in Equation ( 9) and using the φ (k) i as test functions, we obtain the following semi-discrete form: Here, the vector û collects the polynomial expansion coefficients for all elements, and the matrix M has a block diagonal structure, where the diagonal blocks are the local mass matrices M (k) : associated with each element.Notice that all the terms on the right-hand side have been grouped in the vector function h(û), thus obtaining spatial semi-discretization that can be fully discretized by the method of lines approach described below.Thanks to the use of a DG discretization, the resulting mass matrix can be inverted locally for each element.Furthermore, in the case of spherical coordinates, we also simplify the definition of the conserved variables in Equations ( 5) by including the cos θ metric terms directly in the local mass matrix: so that the conserved variables are given by h, hu and hv.Note that the mass matrices are all identical for a specific longitudinal value.
For the time discretization of Equation ( 12), we follow the classical method of lines approach employing Runge-Kutta (RK) methods.More precisely, we use explicit Strong Stability Preserving (SSP) of orders from 1 to 4, denoted later as RK1-RK4, see e.g., [10], which can be defined by means of their Butcher tableaux listed in Table 1.These explicit time discretization methods are only conditionally stable.Their stability depends on the value of the non-dimensional parameter known as the Courant number, which is usually defined as c∆t/H, where c denotes some estimate of the largest eigenvalue of the underlying hyperbolic system and H is the minimum element diameter.For DG methods and other high-order finite element techniques, however, it is customary to redefine the Courant number by taking into account the presence of internal degrees of freedom in each element, see e.g., [18,19,26], so that a more appropriate definition is in this case pc∆t/H, where p denotes the maximum element degree.Due to the reduction of the effective element size at the poles and the use of high-order elements, rather small values of the time step have to be chosen to allow for stable simulations.

Implementations in G4GT and GT4Py
We have implemented DG solvers for conservation laws in separate projects with distinct DSLs for planar and spherical geometry.The Galerkin-for-GridTools (G4GT) and GridTools-for-Python (GT4Py) are both part of the GridTools (GT) [22] ecosystem, which offers an efficient C++ library that is agnostic to the underlying architecture.It makes extensive use of template meta-programming and has backend optimizations for both CPU and GPU architectures.GT was initially designed to target numerical simulations of PDEs using regular grids and finite-difference schemes.Although the latest version of GT supports unstructured grids, the presented implementation relies on its original version.
We remark that G4GT was simply a proof of concept and is no longer supported.Indeed, the popularity of Python among modern-day programmers led the GT team to switch to the Python-based GT4Py layer, which is actively developed and open source.However, the ideas illustrated in G4GT, in terms of both supported PDE models and computational performance, are educational.Thus, before discussing in detail the GT4Py implementation, which should be regarded as the main tool employed, we emphasize a number of features of the G4GT framework that are complementary to the ones of GT4Py.

Galerkin for GridTools (G4GT) implementation
G4GT is a C++-based extension to the GridTools library that supports finite element codes.It relies on GT for the underlying implementation of computation kernels, but also on the Trilinos [12] libraries Intrepid [14] and Epetra [7], which provide the numerical support for finite element discretizations and specific linear algebra tools.The G4GT framework provides the link between these libraries, adding a higher-level, user-friendly layer to GT.Additionally, it adds support for finite element discretizations using GT-based codes, which is not present in GT.
C++ templating allows for this expansion by making stages of the flux computation on the appropriate grid: As the main subject of this work is the GT4Py implementation, we do not delve into additional technicalities of G4GT.However, a more detailed discussion can be found in [6].

GT4Py Implementation
The pipeline of GT4Py is illustrated in Figure 2. The domain scientist expresses the stencils in a user-friendly Python syntax called GTScript, and this code is then processed through a series of toolchains that applies optimizations and generates a high-performance executable targeting a specific architecture.

Backends
GT4Py can compile using various backends; see Table 2 for a complete list of the supported ones at the time of the evaluation.Several others are planned or under development.Three of the seven backends compatible with GT4Py rely on the GT framework to compile and optimize the stencil computations.They are all characterized with the prefix gt:.The gt:cpu ifirst and gt:cpu kfirst both target the CPU architecture, while the gt:gpu backend produces  Alternatively, there is a naive CUDA backend which only utilizes GT utilities, but not its DSL.Finally, a NumPy back end exists, which can be used to inspect the generated code for debugging purposes.

Stencils
Stencils are special GT4Py functions that operate on fields in a specific domain.Fields store the values of variables at each grid point of the domain.
Declaration.In the following example, we compute the discretized 2-dimensional Laplacian operator: which can be written as the following stencil in GT4Py: In the function decorator, we provide the target backend as well as potential back-end options.The function expects fields as arguments, on which the stencil computations are executed.GT4Py uses the Python-type hinting system to specify the data type of each field, which in this case is np.float64.
The body of the function requires two context managers which define the execution of the stencil in the vertical direction, the first being computation which accepts the arguments PARALLEL, FORWARD or BACKWARD.This defines the scheduling of the execution stencil.The keyword PARALLEL, which we use exclusively for our implementation, indicates that there is no dependence between subsequent vertical levels, and hence they can all be solved in parallel.The keywords FORWARD and BACKWARD define this dependence and indicate the direction in which the vertical levels must be solved.The second context manager interval allows the user to specify the vertical indices for which the stencil will be applied.The '...' is a shorthand notation to select the entire vertical domain.
Finally, we note that the stencil computation is applied for each grid point; hence, relative offsets are used as indices.Note that, if omitted, the offset is assumed to be [0,0,0].Invocation.The above laplacian function can be called using the following command: nx, ny, nz = field.shapeorigins = {"field":(1,1,0),"out":(1,1,0)} # or {"_all_":(1,1,0)} laplacian(field, out, origin=origins, domain=(nx-2, ny-2, nz)) We provide the fields relevant to the stencil computation as arguments to the function.In addition, we add two optional keyword arguments, namely domain, which specifies the domain of execution of the stencil, and origin, which defines the origin for each field.In the case of the laplacian stencil, we set these to ensure that the stencil only operates on the inner part of the domain.
The origin argument indicates relative offset between the different fields.The keyword all can be utilized to set the same origin for all fields that have not been specified separately.Note that the keyword names inside the origins dictionary refer to the names of the fields in the stencil definition and not to the names of the fields in the call to the stencil.Upon invocation of a stencil, GT4Py searches for a cached version and relies on just-in-time (JIT) compilation in case none is found.

Storages
In GT4Py, fields are variables on which stencils can be applied.They store values at each grid point inside a 3-dimensional domain.The DSL provides a storage format for these fields which is a wrapper over the array types numpy/cupy.ndarrayscalled gt4py.storages, which ensures that the memory layout of the data is compatible with the requested backend.The interface provides several methods for instantiating storages, including empty(), ones() and zeros(), as well as directly from an existing NumPy array using from array().All of these functions require several additional parameters: shape defines the size of the storage in the three dimensions, and default origin specifies the default origin to be used in case none is specified during a stencil call.Finally, dtype not only defines the data type of the field but can also be used to assign higher-order tensors to each grid point instead of simple scalar values.These fields are subsequently referred to as higher-dimensional fields.
In the example below, each grid point stores a matrix of size 3x2: Moreover, suppose a field has identical values along one or more spatial dimensions.In that case, GT4Py provides a feature called 'masking', which avoids the storage of unnecessary copies of the identical values while still giving the appearance of a full 3-dimensional field.This can lead to a substantial reduction in memory consumption, which is crucial for large problem sizes.The previous field can be masked in the vertical direction using: Note that when using a GPU backend, the fields need to be explicitly synchronized from the device back to the host to obtain the results of a stencil computation.In addition, it is recommended to cast the gt.storage to a numpy.ndarray to ensure that the data has indeed been copied from the device.This can be accomplished with the following code snippet:

Frontend
In this section, we describe the structure of the GT4Py frontend and our contribution to expanding the functionality of higher-dimensional fields.
Abstract Syntax Tree (AST).The Python language uses an interpreter which converts the source code of a program into a representation called an Abstract Syntax Tree (AST) before compiling the program to bytecode which is executed by the computer.As the name suggests, the AST represents the logic of the program as a tree structure stripped of the specific syntax used in the source code.This representation provides a versatile way to inspect and modify Python applications.
In the case of GT4Py, the frontend parses the Python AST and converts it into a series of custom ASTs through the pipeline (Figure 2), which provide additional information necessary for the backends to produce well-optimized executables.
Limited support for higher-dimensional fields.For our implementation, we represent each DG element by a grid point in GT4Py.Each grid point is thus assigned a vector that stores its polynomial expansion coefficients and hence yields a higher-dimensional field as a data structure.We refer to this additional dimension of the field as data dims.
Initially, the support for these vector-valued fields in GT4Py was limited.In particular, there was no functionality for performing element-wise operations between fields with respect to the data dims dimension.Indeed, these operations needed to be explicitly written out for each vector component, reducing their utility to scalar fields.The following example illustrates a stencil performing an element-wise multiplication between two higher-dimensional fields: ): with computation(PARALLEL), interval(...): The first set of indices represents the relative offsets between the fields, while the second set of indices refers to the actual components of the data dims dimension.Note that in this case, the relative offsets cannot be omitted and need to be specified explicitly.
Loop unroller.We have implemented automatic element-wise operations for vector-valued fields in order to facilitate their use.This has been accomplished by modifying an intermediate GT4Py AST called definition ir.In this representation, we have access to the size of data dims for each field which is essential to determine if an operation between two fields should be performed element-wise.Our contribution allows rewriting the previous multiplication stencil using the following simple syntax: # ... with computation(PARALLEL), interval(...): out = field1 * field2 The goal is to modify the GT4Py frontend so that the code snippet above produces the same AST as the previous explicitly unrolled stencil.To implement this functionality, we created two helper classes which apply the necessary transformations to the definition ir.The first one, called UnRoller, receives as argument an AST node representing the right-hand side of an assignment operation and returns a list of AST nodes where each element of the list pertains to a different index of the higher-dimensional field.The second one is called UnVectorisation.It invokes the UnRoller and checks that the dimensions of the returned list match the dimensions of the field on the left-hand side of the assignment operation.If so, it creates the list of AST assignment nodes with the corresponding indices.
Our implementation supports not only chaining together multiple operations on higher-dimensional fields but also broadcasting of scalar values for scalar-vector operations.The syntax and functionality should be intuitive for anyone familiar with the NumPy package.
Matrix multiplication.An additional operation that we required for our DG scheme was a matrix-vector multiplication between higher-dimensional fields.This was incorporated into our existing framework and can be invoked using the "@" operator.Also, the multiplication of a vector by the transposed of a matrix can be achieved by appending the matrix with the "T" attribute.This leads to the following syntax: with computation(PARALLEL), interval(...): out = matrix.T @ vec DG solver: precomputation.At the start of the execution of the program, the GT4Py solver precomputes on the CPU certain variables that remain constant during the whole simulation.This includes the computation of the inverse mass matrix, as well as the Gauss-Legendre quadrature points and weights for numerical integration.A helper class called Vander, defined in vander.py,contains all the Vandermonde matrices required to evaluate the polynomials stored as modal expansion coefficients at nodal values in the domain (see e.g., [13]).These matrices are instantiated as fields using gt4py.storages.
DG solver: stencils.All subsequent computations are carried out using stencils in GT4Py.An example stencil is presented subsequently, related to our DG solver.Applying the theory derived in Section 2 for the linear, constant-coefficient advection problem with e.g., β = [1, 1] T , we will need to evaluate an integral of the following form: This integral can be computed using the stencil below: #... with computation(PARALLEL), interval(...): u_qp = phi @ u_modal fx = u_qp * 1 fy = u_qp * 1 rhs = determ * (phi_grad_x.T @ (fx * w) / bd_det_x + phi_grad_y.T @ (fy * w) / bd_det_y) In line 3, the modal expansion coefficients are mapped to nodal values at the quadrature points.In lines 4 and 5, the flux function in the x and y directions is applied.In this simple case, the flux function is the identity due to the constant velocity field β = [1, 1] T .Finally, in lines 6 and 7, the numerical integration is performed.The scalar field w represents the quadrature weights while the matrix-valued field phi grad x/y contains the spatial derivatives of the basis functions.The terms determ, bd det x/y denote the Jacobians arising from the mapping of the element in physical space onto a reference element.Although not reported here, this description easily generalizes to the SWE case.

Code validation
Firstly, several unit tests have been performed to assess the correctness of our implementations.For the GT4Py implementation, they rely on the existing testing infrastructure of GT4Py, which verifies the success of the code generation as well as the code execution on all backends when compared with a reference Numpy implementation.
Subsequently, both the G4GT and GT4Py implementations were validated on benchmarks derived from the shallow water test suite [29], as well as on tests on a planar geometry presented in [27].These include a convergence test on linear advection of a smooth profile and on a geostrophic zonal flow, the simulation of geostrophic adjustment on the plane, and that of a Rossby-Haurwitz wave in spherical geometry.Although our final goal is the simulation of the SWE on the sphere, all the presented tests provide useful insight from a numerical point of view and contribute to a progressive increase in the complexity of the solutions.

Linear advection convergence of a smooth initial condition
Since both implementations are essentially solving the same problem, albeit with slightly different flux calculations and numerical implementations, we summarize the convergence results for both in this section.Specifically, we apply our DG discretization to planar linear advection on the unit square, assuming periodic boundary conditions and a constant velocity field β, see Equation (13).We consider a smooth initial condition: u 0 (x, y) = sin(2πx) sin(2πy), which allows us to achieve optimal convergence rates.The analytic solution of Equation ( 13) evolves without changing shape in the direction of the velocity field.Due to the periodic boundary conditions, the solution will coincide with the initial condition after one full rotation, i.e., at time T = 1.We use a uniform mesh with K elements obtained from the tensor product of √ K elements in each of the coordinate directions.We measure the error of the numerical approximation using the L 2 norm at the final simulation time, and we denote it by .The expected spatial convergence order for DG methods is given by [13]: where h is the characteristic mesh size and p is the degree of the local polynomials.To estimate the convergence rate, we compute the discretization error using two different meshes with characteristic sizes h 1 , h 2 , that we denote as 1 , 2 , respectively.Then, the estimated rate, denoted by r, is computed as The results obtained with the G4GT implementation are reported in Table 3 and agree with the theoretical expectations, thus validating the implementation.Not surprisingly, the GT4Py implementation achieves nearly identical convergence results to the G4GT version, as shown in Table 4.

G4GT implementation: geostrophic adjustment for planar SWE
For the G4GT implementation, we consider the SWE discretized on a Cartesian mesh in a planar domain.Specifically, we want to show that the scheme is able to reproduce the geostrophic adjustment process, see, for example, the discussion in [27].Starting from a perturbation of the equilibrium state corresponding to a constant water height, gravitational and  rotational forces interact, so that only part of the energy is transported away from the center, leading to a nontrivial stationary solution profile.Consider a square domain Ω = [0, L] 2 with L = 10 7 m and a final time of T = 36000 s.The initial velocities and momenta are set to zero, while the height h is equal to where h 0 = 1000 m, h 1 = 5 m and σ = L/20 m.Assuming an f -plane approximation, the Coriolis parameter f is chosen to be constant and equal to 10 −4 s −1 .The problem is completed with periodic boundary conditions.The simulation has been run using 50x50 spatial elements, a polynomial degree r = 3 and the RK4 scheme in time with step ∆t = 100 s.The results are reported in Figure 3.The solution is consistent with the results reported in [27].

GT4Py implementation: geostrophic zonal flow for SWE on the sphere
After validating the planar version of the GT4Py implementation, we consider two of the classical test cases in spherical geometry introduced in [29] for the shallow water equations.Periodic boundary conditions were applied in the longitudinal direction, while in the latitudinal direction the fluxes were set to zero.Indeed, since the edges become singular at the poles, the flux through them must be zero.
In the benchmark denoted as test case 2 in [29], a stationary zonal flow in geostrophic equilibrium is considered.We perform a convergence test for the spatial discretization, using for all polynomial degrees the RK4 scheme for time discretization with time steps chosen for each resolution so as to keep the Courant number constant and at a very small value.The test case has been run until time T = 2 days on meshes of increasing resolutions.The results are reported in Table 5 and show a convergence behavior entirely analogous to that of the linear advection case in planar geometry.

GT4Py implementation: Rossby-Haurwitz wave for SWE on the sphere
The Rossby-Haurwitz wave (denoted as test case 6 in [29]) consists of a large-scale planetary wave that mimics the high/low-pressure systems typical of mid-latitude weather patterns.The test case considers initial data that would result   in a stable solution for the barotropic vorticity equation, evolving from west to east without changing shape.It is known that this configuration is ultimately unstable -see, for example, the discussion in [25] -but this instability only arises on a relatively long time scale.Therefore, it is customary to assess the quality of numerical methods based on their capability to reproduce a stable eastward moving pattern for several days.In Figure 4, we see the results of an 8-day simulation of the Rossby-Haurwitz wave on a 40x20 grid using the RK4 method in time with ∆t = 4 s and p = 3 in space.It can be observed that the numerical solution indeed evolves from west to east while maintaining a close resemblance with the initial shape and that the simulated pattern is in good agreement with reference solutions, see e.g., [26].

Performance
In this section, we present performance benchmarks of the G4GT and GT4Py implementations for the SWE in both planar and spherical geometry.In order to provide an adequate comparison, we break down the G4GT implementation into three computing blocks which allow us to better analyze the complexity and compare the execution time of the temporal loop for the various backends supported by GT4Py.In both cases, the time spent in the precomputation steps is neglected, as it becomes negligible for long simulation periods.
The G4GT simulations have been run on the compute nodes of Piz Daint at CSCS, using an Intel ® Xeon ® E5-2690 v3, 12-core processor (single node), characterized by a peak memory bandwidth of 68 GB/s.On the other hand, the GT4Py benchmarks were performed on a different partition of Piz Daint with the CPU code executed on two 18-core Intel ® Xeon ® E5-2695 v4 @ 2.10GHz processor (each with 77 GB/s peak memory bandwidth) and the GPU code on an NVIDIA ® Tesla ® P100 with 16GB of memory (540 GB/s peak memory bandwidth).

G4GT performance evaluation
The geostrophic adjustment setup presented in the previous section can also be used to evaluate the performance of the method and G4GT in general.Unless stated otherwise, the physical and numerical parameters are therefore kept unchanged, including a grid consisting of 50x50 elements and a time step of 100 s.
The performance evaluation is done using the Roofline model [28].This is based on the operational intensity, i.e., the number of floating-point operations (flops) per byte of DRAM traffic, and the attainable Gflops per second, i.e., the concrete performance measure.Here, the DRAM traffic takes into account the bytes that are read from/written in the main memory after the filter of the cache hierarchy.Because of hardware limits, the attainable flops per second cannot go beyond a fixed threshold, determined by the peak memory bandwidth and the peak floating point performance.In practice, the actual threshold is determined by running benchmark cases, such as the (bandwidth-limited) stream or the (computationally-limited) linpack benchmark.In our case, these give a memory bandwith upper bound of 44 GB/s and a peak performance limit of 318 GF lops/s [? ].Thus, for a given operational intensity, an efficient implementation in terms of performance should attain values close to the determined limit.In our analysis, we decided to ignore the cache effects.In other words, every access to a variable is considered for the computation of the required bytes.This is in contrast with the definition provided by the model, but a precise estimate of the DRAM traffic is far from an easy task and, in the G4GT framework, no tool is available to appopriately measure it.The matrix-vector multiplication, which is the central operation in the DG implementation, is bandwidth-limited and achieves a performance [6] somewhat below the leftmost (rising) roofline.Based on the way in which the code is structured [6], we can recognize three different kernels: The results for varying polynomial degree p are reported in Figure 5, which compares the performances of the global program and the kernels separately.As a complement to Figure 5, Table 6 breaks down the computational times.Looking at the overall performance, we observe that no significant variations in the operational intensities are present.This is because variations in the polynomial order lead to similar changes in the number of floating point operations and memory traffic.However, it appears that performances obtained with linear basis functions are slightly lower than higher-order polynomials.The total number of operations might not be large enough to attain the expected asymptotic values, causing deviations from the optimal performance.
Looking at the kernels independently, for high values of p the third kernel has the most significant influence on the overall performance.This is expected since it includes the majority of the computations.Specifically, the assembly of the internal integral is the most intensive part, both in terms of resources and time.On the other hand, the second kernel always has a low computational cost.This is not surprising, as only boundary quantities are involved.The performance of this kernel in terms of floating point operations per second are consistently low and do not vary with p. Since it is the only phase that involves exchanges between neighboring elements, it is reasonable that cache misses or inefficient memory accesses are present.No particular trend is observed for the first kernel, except for p = 1, in which this kernel has the dominant effect on global performance.

GT4Py performance evaluation
For the GT4Py implementation, we consider both performance scalability while increasing the horizontal problem size as well as increasing the number vertical layers while holding the horizontal size constant.

Horizontal scaling
We study the scaling of the runtime of the application with increasing horizontal resolution.In the benchmark, we use a 4th-order scheme in space which corresponds to a vector of size 16 stored at each grid point (data dims = 16).Figure 6 compares the runtimes of the various backends.The gt:cpu ifirst backend was used as the baseline, since this is the best performing CPU backend.
Each successive data point doubles the number of grid points in both horizontal directions.Thus, the expected asymptotic scaling is quadratic since doubling the number of grid points in both horizontal directions results in a fourfold increase in the total number of grid points.In this experiment, the two CPU backends, powered by the GridTools framework, have virtually identical execution times.For small problem sizes, the two GPU backends (namely the cuda and gt:gpu) perform worse than the CPU backends.This is due to the under-utilization of the GPU resources for small-scale problems which are not able to fully saturate the device.This is illustrated in the delayed asymptotic scaling of the GPU code when compared to the CPU resulting in better performance for larger problems.When comparing the two GPU backends, we observe that the optimizations provided by the GridTools framework (included in the gt:gpu backend) yields significant better performing code than the naive CUDA backend.Note that we were limited to presenting a maximum problem size of 640x640 due to memory constraints on the GPU.observation from Table 7 is that the fastest GPU backend results in a speedup factor of ∼ 3.2 versus the fastest CPU backend.Since we know from Figure 5a that the performance is limited by memory bandwidth, we expect the speedup to mirror the ratio of bandwidths listed in Section 5 for the P100 GPU to two Intel Broadwell processors, namely, 560 GB/s : 2x77 GB/s 3.6.

Vertical scaling
In this Section we try to assess the potential performance of GT4Py on a 3-dimensional problem by considering a set of decoupled 2-dimensional SWE problem copied in the vertical direction and solved in parallel.This configuration increases the computational load and resembles to some extent those of low order finite difference/finite volume discretizations of 3-dimensional problems in atmospheric modeling.However, it is substantially different from a full 3-dimensional DG discretization, since the local that arise correspond 2-dimensional rather than elements.The resulting algorithm scales linearly with the number of vertical levels.Considering that all levels can be solved independently, this problem is embarrassingly parallel, and we might expect performance benefits on the GPU compared to the CPU. Figure 7 illustrates the execution time of a 4th-order DG scheme on a 300x300 grid with increasing vertical levels.Surprisingly, we do not observe any scaling benefits for this experiment on the GPU.Indeed, all backends exhibit asymptotic scaling from the first data point, indicating that the hardware's resources are fully saturated.This is most likely due to the 2-dimensional problem solved in each level being too large and fully occupying the memory bandwidth of the GPU.We observe that the CUDA backend performs even worse than the GridTools CPU backend.Moreover, it suffers from poor memory utilization compared to the GridTools GPU implementation, as the last data point could not be gathered due to the memory capacity of the GPU being reached.

GPU profiling
In this section, we present the results of a brief profiling that was carried out for the GridTools and CUDA GPU backends using the NSight ® code analysis tools from NVIDIA ® .
We profile the 4th-order DG scheme for the 2-dimensional SWE on a 640x640 grid.Figure 8 illustrates an extract of the profile timeline for the first two time steps of the simulation.The green sections indicate memory transfers from the host to the device, while the blue areas highlight kernel executions.To achieve good performance on the GPU, communication between the device and the host needs to be reduced as much as possible.Considering the memory row in Figure 8, we only observe communication (more precisely host-to-device transfers) in the first time step.These are necessary to transfer the initial fields to the GPU.Subsequently, the GPU is able to execute kernels consistently without having to wait for unnecessary communication from the host.

Conclusions
We have presented two implementation examples of a high-order DG method in the framework of the G4GT and GT4Py domain-specific languages, respectively.The G4GT and GT4Py implementations illustrate that a DSL designed for finite difference/volume methods on rectangular grids, with some extensions, can be reused to achieve the implementation of a DG solver.The related performance analysis illustrated clearly that the DG method is limited by the memory bandwidth, similar to the sparse matrix-vector (SpMV) stencil discussed in [28].
Despite being only a proof of concept, the G4GT DSL has several advantages for the end user over GT, with the main one being the simplicity in its use.Most of the back-end optimizations and the definition of appropriate data structures are hidden in the underlying GT implementation.Therefore, the user can exploit the full capability of GT in a broader range of discretization schemes without delving into the details required to achieve computational efficiency, in complete agreement with the requirements of a good DSL.Several technical reasons, including the need for external libraries and the inability to construct nested functors, as well as the emergence of Python as a programming language, were responsible for the termination of the development of G4GT and the migration to GT4Py.
Despite the user-friendly nature of GT4Py and the increasing support and functionalities available, we had to expand the support for higher-dimensional fields by supplementing the frontend with an new, clean syntax.This allowed us to port an original Matlab implementation to GT4Py with relative ease.Moreover, we saw that we could automatically exploit available accelerators without having to modify the source code.Nonetheless, due to limitations of the functionality of higher-dimensional fields, we were left with writing a lot of boilerplate code in GT4Py.At the time of the implementation, slicing was not supported, which would have allowed us to condense the conserved variables into a large matrix instead of a series of separate vectors.In addition, there was no method for calling functions on higher-dimensional fields which would eliminate the need to copy the same functionality to different stencils.
In conclusion, the new GridTools support for higher-dimensional fields opens the door for implementing advanced numerical schemes like DG in a DSL environment.However, our front-end changes should be complemented with corresponding back-end optimizations to achieve better performance.

Figure 3 :
Figure 3: Numerical results for the geostrophic adjustment test case.
h at t = 4 days.u at t = 4 days.v at t = 4 days.
h at t = 8 days.
u at t = 8 days.
v at t = 8 days.

1 . 2 . 3 .
Common part: The nodal values for the solution and the flux function are computed.Rusanov fluxes: The boundary fluxes are computed, and the boundary conditions are applied.This requires communication among neighboring elements.Main computation: The right-hand side is assembled, and the solution is updated.

Figure 5 :
Figure 5: Performance evaluation of the G4GT implementation of the SWEs in a Cartesian geometry with different spatial degrees p, RK4 scheme.

Figure 6 :
Figure 6: Benchmark of the execution time of the GT4Py backends with increasing problem size.Each subsequent data point doubles the grid points in x and y and thus quadruples the total number of grid points.

Figure 7 :
Figure7: Benchmark of best-performing CPU and GPU backends in addition to the CUDA backend.The plot depicts execution time with respect to the number of identical vertical problems solved in parallel.The final data point for the CUDA backend is unavailable due to the memory limit reached on GPU.

Figure 8 :
Figure 8: Extract of NSight Systems profile of gt:gpu backend.The profile indicates consistent executions of kernels on GPU and no communication between the host and device after the first time step.

Table 1 :
Butcher tableaux of SSP Runge Kutta methods of order 1 to 4

Table 2 :
[3]t of supported GT4Py backends code for the GPU.In addition, two backends utilize the Data Centric (DaCe) parallel programming framework[3]developed by the Scalable Parallel Computing Lab at Swiss Federal Institute of Technology Zurich, namely dace:cpu and dace:gpu targeting CPUs and GPUs, respectively.At the time of the DG-GT4Py implementation, only prototype implementations of these backends were available, and thus we decided not to include it in the subsequent performance evaluation.

Table 4 :
L 2 errors and estimated rate of convergence r for the linear advection problem, GT4Py implementation.

Table 5 :
L 2 errors and estimated rate of convergence r for the Williamson test case 2, GT4Py implementation.

Table 6 :
Computational times of the G4GT implementation of the SWEs in a Cartesian geometry with different spatial degrees p, RK4 scheme.

Table 7
summarizes the speedup observed versus the gt:cpu ifirst baseline on the largest problem size.One gt:cpu kfirst gt:cpu ifirst cuda gt:gpu