A high-order cross-platform incompressible Navier–Stokes solver via artificial compressibility with application to a turbulent jet ✩

incompressible Navier–Stokes formulations compressibility. combined with explicit dual time stepping and a high-order Flux Reconstruction discretisation, the of operations be as compute bound matrix–matrix multiplications that are well-suited for GPU acceleration and manycore processing. In this work, we develop a high-order cross-platform incompressible Navier–Stokes solver, via artificial compressibility and dual time stepping, in the PyFR framework. The solver runs on a range of computer architectures, from laptops to the largest supercomputers, Additional comments including restrictions and unusual features: The algorithm targets modern massively parallel hardware platforms. Cross-platform capability is achieved via runtime code generation.


Introduction
The Computational Fluid Dynamics (CFD) community benefits from a wide range of methods for solving flow problems. Currently, nominally second-order accurate Finite Volume (FV) methods underpin the majority of industrial CFD, due to their robustness, and ability to work on unstructured meshes of complex geometries. However, over the past three decades, significant research has been undertaken into developing methods with increased spatial orders-of-accuracy. These high-order methods have been shown to offer superior accuracy to low-order methods, with the same or lower computational cost [1,2], and have demonstrated potential for performing scale-resolving turbulent simulations more efficiently than their low order counterparts. However, issues remain, especially in terms of industrial adoption. These include a high memory footprint for implicit time stepping, and the challenge of generating high-order curved element meshes [2].
Early attempts at developing high-order accurate spatial discretisations were based on finite difference and pure spectral approaches [3,4], and were hence limited to simple geometries. The first spatially compact high order approach suitable for unstructured grids was the Discontinuous Galerkin (DG) formulation by Reed and Hill [5] for simulating neutron transport. Later, it was popularised and applied to CFD by Cockburn et al. [6,7]. DG is formulated using the integral representation of the governing system by combining a Finite Element type polynomial approximation inside the element, and a FV type interface flux. Other popular high-order schemes with a resemblance to nodal DG are the Spectral Difference (SD) methods by Liu et al. [8,9], which are a generalised version of the staggered grid Chebyshev multidomain method by Kopriva and Kolias [10]. A more recent approach is the Flux Reconstruction (FR) method proposed by Huynh [11] which unifies nodal DG and SD schemes within a single framework. When combined with explicit time stepping, all schemes in the FR framework retain a compact stencil. This spatial locality results in minimal communication between elements, which is of particular importance for modern hardware platforms that are characterised by an excess of compute capability relative to memory bandwidth.
Leveraging the full potential of modern hardware platforms with a pressure based algorithm for solving the incompressible Navier-Stokes equations is difficult since they exhibit significant indirect communication during the elliptic projection stage. The elliptic projection can be bypassed by adopting the Artificial Compressibility Method (ACM) of Chorin [12], which was developed in the late 1960s as an alternative way of solving steady incompressible fluid flow problems. Rather than projecting the pressure with a Poisson equation, a coupling pressure term is added to the continuity equation that drives the system towards a solenoidal velocity field in the limit of steady state. The original formulation preserves the hyperbolic nature of the system, but destroys time accuracy. A common procedure for recovering time accuracy is dual time stepping [13]. In this approach, real time is discretised with a backward difference scheme, whose solution is found by marching the governing equation in pseudo time. In the context of the ACM, the divergence free velocity constraint is satisfied at each physical time step. Adopting explicit dual time stepping as a temporal discretisation and using high-order FR in space, the majority of operations can be cast as compute bound matrixmatrix multiplications that are well-suited for GPUs and manycore processors.
There have been various attempts to apply the ACM in a highorder context. Bassi [14] first succeeded in applying the approach with DG to solve steady incompressible flow problems. Later, Liang et al. [15] extended the method to SD schemes in 2D, providing support for unsteady flows via dual time stepping. Recently, Cox et al. [16] successfully applied the method with FR in 2D, and introduced fully implicit inner iterations. In the current study, we extend the cross-platform high-order CFD framework PyFR (www.pyfr.org) [17,18] to include an incompressible Navier-Stokes solver based on FR, the ACM, dual time stepping, and P-multigrid.
The paper is structured as follows. Section 2 gives a brief overview of the high-order FR approach for mixed element unstructured grids. Sections 3 and 4 detail the ACM and how it can be advanced in time using P-multigrid accelerated dual time stepping. Section 5 describes our cross-platform implementation in the PyFR framework, that can run on a wide range of hardware architectures. Section 6 verifies the platform independence on Nvidia Tesla P100 GPUs and Intel Xeon Phi 7210 KNL manycore processors with a 3D Taylor-Green vortex test case. Additionally, the utility of P-multigrid for convergence acceleration is demonstrated. In Section 7, we apply the solver to a 3D turbulent jet problem at Re = 10,000, which has relevance to many industrial application areas and natural flow phenomena, such as hydrojet propulsion, cooling systems, and seafloor plumes. Excellent agreement with available experimental data is obtained. Finally, conclusions are drawn in Section 8.

Flux reconstruction
A summary of the FR method for mixed unstructured grids is given below. For further details see [19][20][21]. To begin, consider a finite solution domain in Euclidean space Ω ∈ R N D with a coordinate system x = x i ∈ R N D in N D dimensions. In this domain, we wish to solve an advection-diffusion problem written in conservative form as where the conservative field variables are defined as u α = u α (x, t) and the corresponding flux is f α = f α (u α , ∇u α ). The subscript α refers to a single field variable or its flux, where N v is the total number of field variables. A conforming mesh comprising of a set of suitable element types ε defines the discrete representation of the solution domain. Such elements are for example line elements in N D = 1, quadrilaterals and triangles in N D = 2, and hexahedra in N D = 3. The elements in the discrete domain must satisfy where the subscript e refers to all elements of a certain type and |Ω e | denotes their total count such that 0 ≤ n ≤ |Ω e |.
All calculations are performed in a transformed space by mapping each element Ω e into its respective standard elementΩ e ∈ R N D , represented in a coordinate systemx =x i ∈ R N D . The mapping functions relating the two element spaces are defined as The Jacobian matrices and determinants related to the mapping are where∇ = ∂/∂x i . Using the expressions above, the transformed flux and transformed gradient of the solution can be expressed as Furthermore, Eq. (1) can be written in terms of the transformed divergence of the transformed flux as To proceed, consider defining a nodal solution basis within a standard element. Takex u ei to be a set of standard element solution points associated with a given element type, where 0 ≤ i < N e . Examples of such point distributions are Gauss-Legendre or Gauss-Lobatto points, when N D = 1, Witherden-Vincent points [22] for triangles, when N D = 2, and Shunn-Ham points [23] for tetrahedra, when N D = 3. The set of solution pointsx u ei within the standard element Ω e can be used to define a nodal basis set l ei (x) of order P(N e ) that spans a polynomial space P . The nodal basis must satisfy the property l ei (x u ej ) = δ ij . Following the conventions in [24], we first form any modal basis L ei , typically using Jacobi polynomials that span P , and calculate the entries in the generalised Vandermonde matrix V eij . Second, we construct the nodal basis as l ei (x The first step of the FR approach involves computing the discontinuous solution at the flux points from the interpolating solution polynomial These values can be used to find a common interface solution at the flux points via where C (u R , u L ) is a scalar function, analogous to a Riemann solver that returns the common solution, where the subscripts R and L denote the right and left states respectively. The next step is to compute the gradient of the solution at the solution points. For this purpose, we form a vector correction function g f ei (x) which satisfieŝ and subsequently define the transformed gradient of the solution at the solution points as which transforms into physical space as where J −T ein = J −T en (x ei ). Further, the transformed flux at the solution points can be evaluated as and the normal component of the transformed flux at the flux points as Common normal fluxes at the flux point pairs are found with a Riemann solver F (u R , ∇u R , u L , ∇u L , n). The common values are assigned as The normal common interface fluxes transform into the standard element space via The divergence of the continuous flux can be computed with a procedure analogous to Eq. (14), as ] . (21) This serves as the right-hand side in Eq. (9) and the solution can be advanced in time using a suitable time-integration method.

Artificial compressibility for the Euler equations
In the ACM formulation of the Euler equations, the conservative variables in three dimensions x = {x y z} T are where p is the pressure and v = {v x v y v z } T are the velocity components. The total flux is defined as ζ is the artificial compressibility relaxation factor, andî,ȷ andk are unit vectors in the x, y and z directions, respectively. This formulation of the governing equations has a hyperbolic nature; artificial pressure waves traveling at a finite speed are introduced. These waves distribute the pressure and disappear in the limit of pseudo steady state. Eigenvalues of the inviscid flux Jacobian matrices where compressibility relaxation factor ζ which can be adjusted to reduce the system stiffness. The common interface flux F only contains the inviscid term such that F = F e , and the inviscid interface flux is obtained using a Rusanov Riemann solver [25] as where max(λ n ) = |n · v| + c n , with c n = √ (n · v) 2 + ζ the maximum wave speed in the interface normal direction.

Artificial compressibility for the Navier-Stokes equations
In the ACM formulation of the Navier-Stokes equations, the total flux is defined as and ν is the kinematic viscosity. The total normal interface flux is now defined as F = F e − F v , where the viscous part F v is computed using the local discontinuous Galerkin (LDG) approach as presented in [24], The common interface solution needed for the gradient is computed with The two free parameters in the expressions above, β and τ , control the degree of upwinding/downwinding and the solution jump penalisation at the interface. Common practice is to select β = ± 1 2 and 0 ≤ τ ≤ 1.

Overview
A dual time stepping formulation for the artificial compressibility Euler/Navier-Stokes equations can be written as ∂u ∂τ where the first term is a pseudo time derivative which is marched towards zero, resulting in a pseudo steady state. A cancellation ma- When the solution is integrated with respect to pseudo time, the real time derivative can be considered as a source term for the right-hand side that is updated at every physical time step. Table 1 Coefficients for the BDFs.
Discretising the real time derivative with an M-stage BDF, and the pseudo time derivative with a k max -stage explicit multistage scheme, we can write a single pseudo iteration as where the superscripts n and m denote real and pseudo time levels, k is the stage index and α is the stage coefficient. Different real time levels are needed for the BDF expansion in the right hand side computation according to where B σ are the coefficients of the BDFs as listed in Table 1. In this study, the leading term in the BDF is expressed at the first RK stage, which is a common practice in so-called point implicit source term treatment. The treatment results in a scaling coefficient α PI = 1 + α k ∆τ B 0 /∆t which limits ∆τ if ∆τ ≫ ∆t. Being explicit in the pseudo time, the coefficient was found to have no advantage, and in this study we default to α PI = 1.
The subiteration procedure defined by Eq. (32) is repeated until the solution is deemed to have converged. Specific stopping criteria include the following: residuals being driven to a prescribed tolerance, or a fixed number of pseudo iterations being reached. Subsequently, the pseudo steady-state solution is stored as the physical solution and assigned as the initial value for the next time level u n+2,0 = u n+1 . Furthermore, the remaining terms in S t are updated accordingly, u n−(σ +1) = u n−σ for σ = 1...M. Algorithm 1 illustrates the dual time stepping approach with a BDF2 expansion.

P-Multigrid
The efficiency of the dual time stepping algorithm depends on fast convergence of the pseudo steady-state process. The disadvantage of using explicit schemes in pseudo time is that their stability region is limited by the CFL number making them inefficient at eliminating low frequency error modes on fine grids [26]. The issue can be addressed by adopting implicit schemes to allow much larger pseudo time steps [16]. However, the trade-off with implicit schemes is that their storage requirements are significantly higher due to the size of the flux Jacobian matrices which so far appears to have prohibited their use in large scale 3D applications at high orders. Moreover, many methods for solving the resulting linear system, such as LU-SGS or ILU-GMRES, are not scale invariant and increasing the amount of parallelism reduces the numerical effectiveness of the preconditioner.
The P-multigrid technique (perhaps better referred to as multi-P) attempts to circumvent the CFL condition by correcting the solution at different polynomial levels. Without altering the computational grid, low order representations of the solution, which exhibit a less restrictive CFL limit, can be used to propagate information with a larger ∆τ . Another quality of P-multigrid is that when the solution is projected to a lower order space, the low frequency error modes appear as higher frequencies respective to the resolution, which allows explicit steppers to eliminate them more effectively.
In the P-multigrid pseudo time formulation, the following equation is solved where r l is a multigrid source term arising from lower order representations of the solution and the subscript denotes a P-multigrid level l corresponding to any polynomial order between 0 and P.
At the highest P-multigrid level l = P, the source term r l = 0.
Following the methodologies in [15] and [26], a single P-multigrid V-cycle can be formulated according to Algorithm 2. In contrast to [15], our formulation performs the residual defect restriction and P-multigrid source evaluation only on the divergence term R and the real-time derivative term is only accounted for during the smoothing iterations. This was found to modestly accelerate convergence with respect to wall-time, since for each restriction operation it reduces the number of pointwise kernel calls by two. Restriction and prolongation operations can be performed in several ways. In our work, the restriction operator is constructed using the generalised Vandermonde matrix as whereĨ is an non-square matrix of zeros, except on its main diagonal, where it has entries of one. The approach corresponds to transforming the nodal solution into a modal representation, removing the highest order mode and transforming back to a nodal basis. Prolongation is simply performed as Lagrangian interpolation according to

PyFR overview
PyFR is a cross-platform framework for solving advectiondiffusion problems using the FR approach. One of the main advantages of PyFR is that it produces platform portable code with a single implementation using Python and Mako. Fig. 1 provides an overview of the framework. PL is the hardware independent Python layer which constitutes the main body of the framework. PL initialises the data layout, precomputes all necessary matrices, and defines the chain of kernel calls that drive the simulation. Distributed memory parallelism and input/output are also handled by PL. Mesh partitioning, when required, is outsourced to METIS [27].  The kernels called by PL can be divided into matrix multiplication kernels MK and pointwise kernels PK. The MK are used for all operations where a time-wise constant operator matrix multiplies a large state matrix, such as interpolation, extrapolation, and antialiasing. An example is interpolation of field variables from the solution points onto the flux points, which is performed for a given element type via where All MK are off-loaded to GEMM (GEneral Matrix Multiply) subroutines from vendor supplied BLAS (Basic Linear Algebra Subprograms) libraries or bespoke GiMMiK [28] and LIBXSMM [29] kernels. The latter two kernel libraries leverage the a priori known structure/sparsity of the operator matrices and can considerably reduce wall-clock time in certain circumstances [28]. PK are used for operations that require pointwise calculations, such as computing the non-linear fluxes, Riemann solves, and boundary conditions. They are built at runtime by passing platform-unified Mako kernel templates to a Mako derived templating engine that produces low level platform specific source code. The low level code is then compiled as shared libraries and linked to the PL at runtime. The templating engine automatically generates appropriate indexing for vectorisation and efficient memory accesses which vary across platforms.  which are registered in Internal and MPIInt, respectively. This approach allows the data exchange and rank-local computation to happen simultaneously. The third interface class BoundaryInt registers all boundary condition PKs. Additionally, kernels for computing the common interface solution C u f are needed for all interface classes that require the viscous flux calculation. Fig. 3 demonstrates a two step procedure for registering a PK to the PL. As an example, we only consider a PK for computing the discontinuous fluxf u that is registered to the ACNavierStoke-sElements PL class. All other classes in Fig. 2 are built using an analogous approach. First, the kernel must be registered to the backend by providing the location of the Mako template, which is done on line 6. Second, the kernel must be added to the pointwise kernels dictionary as a lambda function, which is done on line 11. The lambda function must specify keyword arguments for all variables that are passed from PL to PK, which include pointers to the input and output matrices, u (state), smats (mapping), f(flux), and auxiliary constant variables to facilitate templating tplargs, dims.

Boundary conditions
All boundary conditions were implemented as separate boundary PKs. These PKs impose the boundary conditions via ghost states for the underlying pseudo time system, which converge to physical boundary conditions in the limit of pseudo steady-state. Specifically, three ghost states per boundary condition are required. In the Riemann solver, an inviscid ghost state B Rie u replaces the right-hand-side solution state u R . In the LDG approach, a viscous ghost state B LDG u replaces right-hand-side solution state u R , and a solution gradient ghost state B LDG ∇u replaces the right-hand-side gradient ∇u R .
The following boundary conditions are available for the ACM solvers. A velocity inlet condition is imposed via ghost states defined as where the superscript b denotes a user-specified free-stream value and the subscript L denotes the domain-side state. A pressure outlet condition is imposed via ghost states defined as A no-slip wall is imposed via ghost states defined as where the superscript w denotes the wall-velocities which are zero for stationary wall. A slip-wall is imposed via ghost states defined as (53)

P-multigrid accelerated dual time stepping
Dual time stepping was implemented as a new DualIntegrator composite class that comprises of DualController (DC), DualPseudoStepper (DPS), DualStepper (DS) and MultiP (MP) subclasses as per Fig. 5. To begin, the DualIntegrator/MultiP class launches an instance of a systemcls for each polynomial order in the P-Multigrid cycle, which have the ability to compute the divergence of the flux R. Without P-multigrid acceleration, a single instance at the solution order P is created. In case of multiple systems, only one instance can be active at a time. The active system is a MultiP class property which returns a system corresponding to a class variable self.level. For example, the system ACSystem 1 is activated by changing self.level = 1. Additionally, MultiP pre-computes and stores the prolongation and restriction operators. The MultiP implementation also supports having different GEMM providers (vendor BLAS, GiMMiK, LIBXSMM) and anti-aliasing options across P-multigrid levels.
System activation and time-integration are managed by the DualController subclass. Specifically, it advances the solution to an arbitrary physical time by calling a sequence of step(), restrict(l1, l2) and prolongate(l1, l2) methods, which form the multigrid cycle. The arguments l1/l2 are the multigrid levels before/after application of the operator. The cycle can be arbitrary between the levels P and 0, and it is repeated until the solution has converged. The convergence is monitored using platform specific reduction kernels. These kernels were implemented independently for each backend, allowing efficient per-field-variable reduction in the context of an AoSoA (Array of Structures of Arrays) data layout.
The step() method, which sets the explicit smoother scheme, is defined in the DualPseudoStepper subclass. Current options are the forward-Euler, TVD-RK3 and RK4 schemes. The physical time discretisation method is defined in the DualStepper class.
The BDF source term is automatically added to appropriate entries of the right-hand-side state matrix during the step() call. Options for the physical time discretisations are backward-Euler, BDF2 and BDF3. Both stepper subclasses were structured to facilitate the addition of further schemes using simple Python syntax.

Overview
A 3D Taylor-Green vortex test case at a Reynolds number Re = 1600 (based on the diameter and peak velocity of the initial vortices) was studied to verify platform independence and investigate the effect of P-multigrid acceleration on performance. In the test case, a set of large vortices interact with each other, transition to turbulence, and finally decay via viscosity. Initial conditions defining vortices with a diameter of 1.0 and a peak velocity of 1.0 were prescribed as v x = sin(x) cos(y) cos(z) , in a periodic domain −π ≤ x, y, z ≤ π . Fig. 6 shows volume renderings of the vorticity magnitude for the initial condition and during the enstrophy peak at t = 8.
A total of three double precision simulations were performed on two state-of-the-art platforms. First, the case was run with the CUDA backend on two Nvidia Tesla P100 GPUs with and without P-multigrid. Additionally, the multigrid accelerated simulation was repeated with the OpenMP backend on two Intel Xeon Phi 7210 KNL manycore processors. All simulations were launched with an optimal kernel configuration which was found a priori by systematically studying the effect of GiMMiK and LIBXSMM cutoff parameters on the performance. On P100 GPUs, the optimal performance was achieved by offloading all matrix multiplications with number-of-non-zero elements less than 1800 to GiMMiK, which means that only the restriction and prolongation operator between P = 4 and P = 3 were computed by cuBLAS. On Intel Xeon Phi 7210 KNLs, LIBXSMM outperformed any combination of MKL and GiMMiK and was thereby globally enforced by setting its cut-off based on matrix size to 100,000. Moreover, to achieve optimal performance, the MCDRAM of the Intel Xeon Phi 7210 KNL processor was configured to work in flat mode, and an  OpenMP/MPI approach with two ranks per node to was used to saturate the interconnects.
All simulations were performed on a hexahedral mesh of 52 × 52 × 52 equisized elements using PyFR v1.7.5. The solution polynomial order used in this study was P = 4 with a Gauss-Legendre solution point distribution. No subgrid-scale turbulence model or spatial filtering was applied, and the simulation can be considered as implicit LES or under-resolved DNS. The physical time was integrated using a BDF2 scheme with a time step size ∆t = 0.006. The classical RK4 scheme was used as the pseudo time scheme/smoother in all simulations, and the artificial compressibility factor was fixed as β = 3. To ensure impartial comparison of the P-multigrid and the single-level simulation, all simulations where performed with a pseudo-time step size ∆τ = 0.0014 that was optimised for the single level case. The optimal ∆τ was found by studying the rate of convergence with a constant number of RK4 iterations at the enstrophy peak t = 8 via a bisection approach. A 5-level cycle 1-1-1-1-2-1-1-1-3 with the RK4 smoother corresponding to polynomial orders 4-3-2-1-0-1-2-3-4 was used in the P-multigrid accelerated simulations, and the pseudo time step sizes at lower levels were increased according to ∆τ l = 1.85 4−l ∆τ .
To exclude the effects of convergence monitoring on the measured wall-times, all simulations were performed with a fixed number of pseudo-iterations per physical time step, specifically three cycles for the P-multigrid accelerated cases, and 75 RK4 iterations for the single-level case. These values were selected heuristically to ensure both the P-multigrid and single-level cases achieved similar levels of convergence in velocity field divergence, which is directly proportional to the pressure residual. Furthermore, it was verified a posteriori that the velocity field divergence was on average 1.25 times lower for the P-multigrid accelerated cases compared with the single level case, leading to conservative estimates for any Pmultigrid speed-up. Table 2 shows a summary of the simulations. The mesh and input files are provided as Electronic Supplementary Material to this paper. Fig. 7 plots the temporal evolution of D, the solution-point-wise L2 norm of the velocity field divergence evaluated at the end of each physical time step, for Cases 1 and 2. It was found that D was on average 1.25 times lower for the P-multigrid accelerated cases compared with the single level case. It can also be seen that the effectiveness of P-multigrid is most apparent at the beginning of the simulation, when lengths scales are larger, due to more effective low wave number smoothing. When the large vortices break into smaller structures near the enstrophy peak at t = 8, the low wave number smoothing associated with P-multigrid becomes less important, and both Case 1 and Case 2 converge to the same level. Fig. 8 shows the dissipation of the total kinetic energy

Results
and the temporal evolution of enstrophy  where ω is the vorticity vector. The volume integrals were computed using quadrature rules, the quadrature nodes being the 4th order Gauss-Legendre solution points. Fig. 8 also includes reference results from van Rees et al. [30] who used an incompressible pseudo-spectral code with 512 3 degrees of freedom, and compressible PyFR results at Ma = 0.1 from Vermeire et al. [1] with identical resolution to the current setup. Three observations can be made. First, results of Case 2 and Case 3 are identical which verifies the platform and backend independence. Second, the P-multigrid accelerated Case 2 produces slightly more accurate results than Case 1, which is in line with the better convergence observed in Fig. 7. Third, the P-multigrid accelerated Case 2 results are more accurate than the low-Mach compressible solution from Vermeire et al. [1] at the enstrophy peak at t = 8, and homogeneous turbulence decay phase t > 15, which suggest that the ACM formulation captures the incompressible physics better than a low-Mach approach. Table 3 shows the wall-time for each case. The results confirm that P-multigrid yields an over 3.5 times speed-up compared to single level pseudo time stepping, while maintaining slightly better solution accuracy and convergence. The results also show that the OpenMP backend on Intel Xeon Phi 7210 KNLs is approximately 2.5 times slower than the CUDA backend on Nvidia Tesla P100s.
To put the wall-times in context, the compressible Ma = 0.1 simulation of Vermeire et al. was repeated on the same Nvidia Tesla P100 hardware as Cases 1 and 2, using the configuration file provided in the Electronic Supplementary Material of [1]. The walltime of the compressible Ma = 0.1 simulation with P = 4 and adaptive explicit RK45 time stepping was 05:44:02 on two Nvidia Tesla P100 GPUs. This is 1.6 times longer than the time required for Case 2 to complete, further demonstrating the utility of an ACM approach with P-multigrid convergence acceleration. of the jet) was studied to validate the solver, and to assess strong scalability on massively parallel systems. The test case was chosen since experimental data are available for comparison [31], and it has relevance to many industrial application areas and natural flow phenomena, such as hydrojet propulsion, cooling systems, and seafloor plumes. Influential experiments and a general theory of incompressible turbulent round jets are discussed in the review of Lipari and Standsby [32]. Round jets at various Reynolds numbers have also been studied numerically: closest to our setup being DNS by Boersma [33] at Re = 5000 and explicitly filtered LES by Bogey and Bailly [34] at Re = 11,000. However, both of these studies have been performed with compressible codes at higher Mach numbers, 0.6 and 0.9, respectively. Fig. 9 shows the computational grid in the y − z plane at x = 0 together with a schematic of the simulation setup in the y − x plane at z = 0. The diameter of the jet was 0.5. The origin was located at the center of the jet as it enters the domain. A two dimensional unstructured circular grid of diameter 24 was extruded in x for a distance of 50 in 250 equally sized steps. The resulting 3D mesh contained 247,250 hexahedral elements in the center of the domain 0 < r < 2.5, where r = √ y 2 + z 2 , and 596,500 prismatic elements elsewhere. The virtual origin is located at (x 0 ,0,0), which is the starting point of the self-similar region associated with a linear velocity decay and spreading rate [32].

Incompressible jet at
The jet inflow profile with a peak velocity of 1.0 was imposed as V jet (r) = 0.5 − 0.5 tanh [20 (r − 0.25)] .
(60) A no-slip condition was applied at the boundary surrounding the jet inflow zone, a slip-wall condition was applied at the cylindrical far-field boundary, and a pressure of 10 was imposed at the outlet. A sponge layer was found to be necessary to dissipate the jet before it impinged on the outlet. Specifically, it was imposed via a spatially dependent source term S defined as where u out = {10 0 0 0} T . The mesh and input files are provided as Electronic Supplementary Material to this paper.

Strong scaling
A strong scaling study for the jet test case was undertaken to demonstrate the scalability of the solver and to find optimal runtime parameters for the full physics run. The solution polynomial order was selected as P = 4, and the physical time was advanced using BDF2 with ∆t = 0.005. A 5-level P-multigrid with an RK4 smoother cycle 1-1-1-1-2-1-1-1-3 was identified to yield good performance in pseudo time. Additionally, due to small and highly curved elements near the center of the domain, flux divergence anti-aliasing of orders between 7 and 4 (7-7-6-5-4-5-6-7-7) was added to increase the stability of the smoothing iterations. The time step sizes between the P-multigrid levels were varied with ∆τ l = 1.7 4−l ∆τ , where ∆τ = 0.0007, and the artificial compressibility parameter was kept constant at β = 2.5.
The simulation was performed with three P-multigrid cycles per physical time step, leading to D ≈ 8 × 10 −4 as the simulation progressed. The GiMMiK number-of-non-zero cut-off parameter was specified as 3560, which outsourced the restriction/prolongation between levels P = 3 and P = 4 to cuBLAS, while the rest of the matrix multiplications were performed by GiMMiK. In addition, strong scaling was studied without P-multigrid acceleration using single level P = 4 pseudo time stepping using 75 iterations, but an otherwise identical setup. Double precision was used throughout. Fig. 10 shows strong scaling from 9 through 144 Nvidia P100 GPUs with and without P-multigrid acceleration. Strong scaling in both cases is almost linear up to 36 Nvidia P100 GPUs after which both cases start to tail off. The P-multigrid accelerated case experiences quicker decline due to the presence of low-order iterations with fewer degrees of freedom. For example, on 144 Nvidia P100 GPUs, there are only 1.64 P = 0 solution points per CUDA core, whereas the corresponding number at P = 4 is over 146. Nevertheless, P-multigrid still improves the time to solution on 144 Nvidia P100 GPUs by over a factor of 2.5 compared to single level pseudo time stepping.

Results
The full turbulent jet case was simulated using PyFR v.1.7.5 with P-multigrid acceleration and the runtime parameters described in Section 7.2 on 96 Nvidia Tesla P100 GPUs with double precision. The simulation was performed as implicit LES so that no subgridscale model or spatial filtering was applied. The simulation was first run to t = 750, at which time spurious transients were observed to have been dissipated by the sponge region. Subsequently, the simulation was restarted and turbulent statistics were gathered until t = 1800, when the time-averaged quantities were found to be visually converged.  9. Computational grid in the y − z plane at x = 0 together with a schematic of the simulation setup in the y − x plane at z = 0. The virtual origin is located at (x 0 ,0,0).

Fig. 10.
Strong scaling of the incompressible jet on Nvidia Tesla P100 GPUs with Pmultigrid (P-MG) and without P-multigrid (RK4). The red line indicates ideal strong scaling. Fig. 11 shows the volumetrically rendered instantaneous velocity magnitude field at t = 750, illustrating the shape of the jet and the resolution of the turbulent scales. In addition, we can see that the outlet sponge region does not have noticeable adverse effects on the jet development. Figs. 12 and 13 provide a comparison between the computational results and experimental data at Re = 11,000 from Panchapakesan and Lumley [31]. Fig. 12a shows 1/v c as a function of a shifted coordinate x − x 0 , where v c is the time-averaged streamwise midline velocity and x 0 is set as 2.5 to allow comparison with the experimental data. Fig. 12b shows v x /v c as a function of the self-similarity coordinate η defined as where v x is the stream-wise velocity averaged both in time, and over conical planes of constant η in the region 10 ≤ x ≤ 30. The profiles are in very good agreement with the experimental data and demonstrate that the correct average decay rate is achieved not only in the midline but also elsewhere in the jet. Fig. 13 shows v ′ x /v c and v ′ r /v c as a function of η, where v ′ x and v ′ r are the stream-wise and radial root-mean-square velocity fluctuations, respectively, averaged over conical planes of constant η in the region 10 ≤ x ≤ 30. Again, the simulation results show very good agreement with experimental data. These results demonstrate that the ACM solver is able to simulate fully turbulent incompressible flows at scale on massively parallel systems.

Conclusions
A high-order cross-platform incompressible Navier-Stokes solver has been implemented in the PyFR framework using the ACM and P-multigrid accelerated dual time stepping. The extensibility of the cross-platform templating framework defined within PyFR has been clearly demonstrated. Platform independence was verified on Nvidia Tesla P100 GPUs and Intel Xeon Phi 7210 KNL manycore processors with a 3D Taylor-Green vortex test case. Additionally, the utility of P-multigrid for convergence acceleration was demonstrated; reducing time-to-solution by a factor of 3.5 compared to single level pseudo time stepping. Finally, the solver was applied to a 3D round jet test case at Re = 10,000, and excellent agreement with experimental data was obtained.
The new software constitutes the first high-order accurate cross-platform implementation of an incompressible Navier-Stokes solver via the ACM and P-multigrid accelerated dual time stepping to be published in the literature. The technology has   13. Plot of (a) v ′ x /v c as a function of η, where v ′ x is the root-mean-square stream-wise velocity fluctuation averaged over conical planes of constant η in the region 10 ≤ x ≤ 30. Plot of (b) v ′ r /v c as a function of η, where v ′ r is the root-mean-square radial velocity fluctuation averaged over conical planes of constant η in the region 10 ≤ x ≤ 30. applications in a range of sectors, including the maritime and automotive industries. Moreover, due to its cross-platform nature, the technology is well placed to remain relevant in an era of rapidly evolving hardware architectures.