ExWave: A high performance discontinuous Galerkin solver for the acoustic wave equation

AhighperformanceimplementationofadiscontinuousGalerkindiscretizationwithexplicitRunge–Kutta and arbitrary derivative (ADER) time integration schemes is presented to solve the acoustic wave equation. ForADER, botha globalanda localtime steppingvariant issupplied. Theimplementation isbased on the matrix-free framework of the deal.II finite element library providing efficient evaluation routines for quadrilaterals and hexahedra. The implementation is generic and its applicability is demonstrated for academic examples as well as real world problems like urban acoustics. We present the physical and numerical problem description, the general code structure, and the design principles. (http://creativecommons.org/licenses/by/4.0/).


Motivation and significance
The acoustic wave equation is used in a wide range of applications. Amongst others it allows to predict room acoustics for construction projects or to optimize urban planning and city design in terms of noise. Countless numerical tools exist to solve the acoustic wave equation each with its advantages and disadvantages. Most challenging is the accurate simulation of high frequency waves because they require high spatial and temporal resolution. One group of solution techniques circumventing the difficulties induced by high frequencies and applicable to room acoustics are geometric methods based on ray-tracing, which however are not sufficiently accurate for low-frequencies and diffraction [1]. Another approach is to assume that acoustic waves propagate diffusely in rooms after a sufficient number of reflections and a diffusion equation model is solved (for example, see [2]).
The prediction of acoustics over a wide frequency range requires to solve the acoustic wave equation either with purely numerical schemes with a sufficiently fine discretization or semianalytic schemes relying partly on fundamental solutions of the wave equation. Straightforward finite difference time domain (FDTD) methods are widely used but due to limited computational resources only applied to lower frequencies [3]. More recently, adaptive rectangular decomposition has been proposed, which is a domain decomposition technique relying on the analytic solution of the wave equation in rectangles and additional interface handling [4,5]. They are currently limited to homogeneous sound speed distributions. Besides finite difference schemes, also conventional finite elements, mixed elements, spectral elements, discontinuous Galerkin (DG) methods and finite volume methods have been applied to the acoustic wave equation, see [6][7][8][9][10]  ADER ADER as in (4) or [12] k + 1 -ADERLTS ADER LTS as in [12] k + 1 -ADERADCONFULL adjoint consistent ADER [12] k + 1 -The basis for this work are the publications [11,12] on DG methods for the acoustic wave equation. In this contribution, we present a high performance high-order DG solver for the acoustic wave equation that is general in terms of geometries, frequency range, and speed of sound distributions and is combined with explicit Runge-Kutta as well as arbitrary derivative (ADER) time stepping schemes [13]. Our code supports adaptive mesh refinement in space as well as local time stepping in time, to concentrate the computational work in the regions where the most interesting physics happen. The implementation is based on the deal.II finite element library [14] supplying efficient matrix-free evaluation routines for quadrilaterals and hexahedra [15,16]. The algorithm provides a very high performance on modern hardware in terms of throughput, degrees of freedom processed per second, and scalability [13].
ExWave is valuable for the scientific computing community because it balances high performance and code optimizations with applicability to real world problems and available hardware. For the community of computational acoustics, this code allows to approach currently unexplored scenarios because of its high flexibility in geometry, mesh generation, or in terms of adaptivity (as in [12]) but also delivers solutions of predictable accuracy in reasonable time with moderate computational resources.

Theoretical background
The acoustic wave equation written as first order system in terms of the pressure p and the particle velocity v is given as with the speed of sound c and the mass density ρ. This equation holds on a d-dimensional spatial domain Ω ∈ R d and in the time interval [0, T ] with final time T . The first equation represents the conservation of momentum while the second enforces the conservation of mass. The acoustic wave equation is accompanied by initial conditions on the pressure and the velocity as well as boundary conditions. Currently, Dirichlet boundary conditions for the pressure, homogeneous Neumann boundary conditions for the normal component of the velocity, and a first-order absorbing boundary condition are implemented in ExWave. Spatial discretization is carried out by the DG method using the accurate fluxes from the hybridized DG method as described in [11]. Temporal discretization is carried out by explicit Runge-Kutta schemes as presented in [11] or by ADER time stepping as derived in [12,13]. The solution is expressed by time-dependent unknowns summarized in the vectors V , P. The update rule for an s-stage Runge-Kutta scheme reads and the coefficients a jl , b j from the Butcher tableau of the respective scheme. Therein, ∆t is the time step and the matrices Q and K are the mass and stiffness matrix resulting from the spatial discretization as shown in [12]. For ADER time discretization, the update rule is where k is the polynomial degree of the utilized shape functions, N is the matrix holding the shape functions, the operator S comprises the spatial derivatives representing the wave equation, see [12] for details. Both methods (3) and (4) yield optimal convergence of order k + 1 in the pressure p and the velocity v. By means of reconstruction and postprocessing, even superconvergent results of order k + 2 in p can be obtained.

Numerical example
While the implementation is quite general (see e.g. the example in Section 3.3), we will explain it along with an academic example for which the parameter file is provided in the current code version. The example is a vibrating membrane problem for which analytical solutions are available and implemented, therefore allowing for straightforward convergence tests. On the d-dimensional cube with Ω = [0, 1] d , a vibrating membrane with m modes per coordinate obeys the exact solution with homogeneous Dirichlet boundary conditions. adaptations to more general setups can be easily introduced by extensions of the code as shown in Section 3.3.
The parameter file allows to adapt the spatial dimension (d = 2,3), the spatial discretization (i.e., mesh, mesh refinement, mesh transformation). Additionally, parameters concerning the temporal discretization can be adapted. For time integration, one can choose between the schemes listed in Table 1. Also, the user inputs the final time, output time steps, and the Courant number from which the time step is derived. The parameter m in Eq. (5) is specified by membrane_modes. Other than that, ADER and ADER LTS specific parameters may be set with use_ader_post enabling the reconstruction and spectral_evaluation referring to the fast evaluation introduced in [13].

Software description
The presented code is based on version 9.1-pre 1 of the deal.II finite element library [14]. In order to ensure the functionality and correctness of the code and to ease further developments, unit tests checking the convergence orders for several setups are incorporated using the ctest testing framework.
1 Any commit after 102d808 from Nov. 28, 2018. The main class of ExWave is WaveEquationProblem. Its method run() executes the time loop. The main components are a time integrator derived from ExplicitIntegrator and a spatial operator derived from WaveEquationOperationBase, as shown in Fig. 1. The time integrators execute the vector updates and call the spatial operator application. For ADER, spatial and temporal evaluation are strongly interlinked and the entire evaluation takes place in WaveEquationOpeationADER. The LTS requires a complex update call, which is handled by a ClusterManager which in turn is called by WaveEquationOperationADERLTS.
The class WaveEquationOperation is templated on the dimension d and the polynomial degree k of the shape functions.
It relies heavily on the MatrixFree class of the deal.II library and uses the optimized evaluation routines FEEvaluation and FEFaceEvaluation. Matrix-free operator evaluation allows for a much higher performance compared to classical matrix-based schemes, which is due to a higher arithmetic density. Also, fast integration techniques relying on sum factorization utilizing the tensor product structure of the shape functions are used and explicit cross-cell vectorization is enabled. Details on matrix-free methods with sum factorization techniques and performance in the context of DG for the acoustic wave equation can be found in [15,16] and [13], respectively. WaveEquationOperationBase can be flexibly switched between single and double precision by a typedef value_type.
In the file time_integrators.h, not only the time integrators but also an optimized vector updater are implemented. The vector updater RKVectorUpdater merges several vector updates into a single loop over the entries and thereby reduces the number of required vector reads from five to two per stage. This contribution allows one to increase performance by a factor of 1.7× for LSRK45R2 on modern processors where performance is typically memory-bandwidth limited [13]. One aspect of the code we want to mention explicitly is that it allows to run an iterative CFL stability analysis based on a stability criterion in terms of L 2 pressure errors to find a tight fit of the critical Courant number for a certain problem configuration.

Illustrative examples
In this chapter, one academic example is presented, followed by two performance tests, and last a representative for real world problems.

A convergence test
The correctness of the methods and code is demonstrated with convergence tests. To run a convergence test based on the vibrating membrane, a basic set of input parameters must be specified and then several simulations are run varying the parameter n_refinements. The L 2 pressure error at the final time is output.
Results are shown for two different exemplary configurations. The first setup is a two-dimensional geometry and the temporal discretization relies on ADER LTS. Furthermore, the parameter use_ader_post=true is set to obtain superconvergent results [12]. The second setup is three-dimensional and temporal discretization is based on LSRK33R2. For both setups, the number of membrane modes in the analytic solution is set to the polynomial degree of the shape functions and cfl_number=0.1 with n_initial_intervals = 5 . All other parameters are set as in the default parameter file. Fig. 2 summarizes the results for the tested polynomial degrees k = 1, 2, 3, 4, 5, 6. The expected convergence orders of k + 1 for the pressure p and k + 2 for the postprocessed pressure p * are obtained for all tested polynomial degrees with ADER LTS. For LSRK33R2, the expected convergence rates are only obtained for coarse discretizations. For fine discretizations, the temporal error dominates because LSRK33R2 is only of order three.

Performance evaluation
We compare ADER and LSRK45R2 in terms of the throughput on 28 cores of a dual-socket Intel Xeon Broadwell E5-2690v4 machine operating at 2.6 GHz, compiled with the g++ compiler, version 6.2, at optimization level -march=haswell -03 -funroll-loops.
The setup is as in Section 3.1 with a three dimensional geometry meshed with 80 3 elements for k = 1, 2, 3 and 40 3 elements for k = 4, . . . , 12. Fig. 3 plots the results. A throughput up to 6.33 · 10 8 degrees of freedom per second for ADER with k = 3 is reached, which is 3.9 times higher than the respective throughput for LSRK45R2. For higher polynomial degrees, the advantage of ADER decreases slowly due to the fact that ADER is of order k + 1 with additional computations for the higher orders while LSRK45R2 is of order four for all polynomial degrees.
To demonstrate parallel capabilities, we perform a strong scaling experiment with h-adaptivity based on a three-dimensional  setup as in [13], Section 4.7, incorporating a material inhomogeneity with an impedance mismatch of four and an unstructured mesh capturing a curved interface between the two materials with a high-order mapping. An adaptive update is carried out every 500 time steps at a Courant number of 0.1. Fig. 4 shows the results obtained on the SuperMUC Phase 2 system using 112 to 7168 cores of Intel Haswell E5-2697 v3, running at a frequency of 2.1 GHz. The computational time scales almost ideally while a slowdown can be observed for the adaptation as explained in [19][20][21] and due to the fact that data structures for h-adaptivity in deal.II are not as optimized as the matrix-free implementations. The weak scaling is better since the main cost of the adaptivity is due to the cost of deal.II's generic transfer routines between mesh levels, the geometry evaluation at ghost cells, and cost of data transfer while repartitioning. Writing output scales good for small numbers of cores but degrades after 1792 cores.

Urban acoustics
This example examines sound propagation in a village representing outdoor acoustics, which is relevant for urban planning and city design [22] or useful for gun shot localization in the context of crime control [23]. The geometry under consideration is based on the artificial village presented in [24,25], where a FDTD method and an adaptive rectangular decomposition approach are  ing to walls, roofs, and the ground are assumed to be perfectly reflecting, whereas the first order absorbing boundary condition is applied on all other boundaries. A sound source is located at (62, 104, 1) corresponding to SP1 from [25]. To read in the externally generated mesh and set corresponding boundary conditions and initial pressure fields, adaptations to input_parameters.h are necessary as shown in the branch urbanacoustics of the supplied software. In [25], simulations were run on a mesh consisting of 1.1 · 10 7 grid points and a time step size of ∆t = 3.85 · 10 −4 with 2000 time steps which corresponds to a simulation frequency of 450 Hz, taking 20 min on a single core CPU machine. Here, we run simulations on a discretization of average grid spacing 1.2, k = 3, resulting in 1.15 · 10 7 grid points with ∆t = 2.45 · 10 −5 and 19704 time steps to reach the same final time. Fig. 6 plots several pressure snapshots to give an impression of the sound propagation patterns. Simulation on 28 cores of a two socket Intel Xeon E5-2690 v4 Broadwell 2.6 GHz system requires 2.5 · 10 3 seconds which corresponds to 76 CPU minutes for 2000 time steps. Hence, a computation on the same geometry with the same number of grid points and time steps is only 3.8 times slower compared to the adaptive rectangular decomposition. This is a very good result, considering that the adaptive rectangular decomposition relies on a semi-analytic approach using the discrete cosine transform in the decomposed rectangles and considering that their time integration is not of order five but a two step type. A comparison of the computational performance considering accuracy and temporal stability should be addressed by future work.

Impact
The presented algorithm was used in several research papers [11][12][13]. The code impacts two communities. First, for the scientific computing community, this code is a representative of a very efficient implementation of matrix-free operator evaluation with throughput close to much simpler finite difference methods, despite support for complex meshes, including local mesh refinement and curved meshes. The code allows to run conclusive performance tests as shown in [13] and Section 3.2, and therefore enables researchers concerned with high performance computing to test and validate new methods to reduce computational time. This new concept can be applied in a wide variety of problems beyond the acoustic wave equation, such as other wave propagation problems in electromagnetics or seismics. Furthermore, the Runge-Kutta related code part can be easily adapted to nonlinear problems, such as the compressible Navier-Stokes equations with explicit time stepping schemes. As demonstrated by results in [26], the present concepts are faster by a factor three to five over the state of the art in the DG literature. In addition, it is an ideal test bed for introducing advanced features, such as cut finite element technology for representing complicated geometries on unfitted meshes, e.g. by the ongoing work [27].
Secondly, this code will have an impact on computational acoustics. Historically dominated by finite difference and semianalytic methods, we now have a code at hand which is not only high performing but also applicable to problems of practical relevance. It offers more flexibility compared to finite difference solvers, e.g. in terms of meshes and adaptivity.

Conclusions
We presented the high performance code ExWave to solve the acoustic wave equation based on higher order DG spatial discretization in combination with explicit Runge-Kutta and ADER global and local time stepping relying on the deal.II finite element library, or, more precisely, the matrix-free framework supplying fast quadrature with sum factorization. Also, ExWave allows to validate implementations in terms of spatial and temporal convergence, to determine temporal stability limits in terms of a CFL stability analysis, as well as to measure and compare computational performance for different discretizations. Urban acoustic simulations, as exemplary shown in Section 3.3, enable city planning or street canyon design minimizing noise exposure. A potential extension is the implementation of a variety of boundary conditions that are commonly used in room and urban acoustics. A future study should compare this solver to adaptive rectangular decomposition methods not only in terms of computational time but also accuracy.