PyFR: An Open Source Framework for Solving Advection-Diffusion Type Problems on Streaming Architectures using the Flux Reconstruction Approach

High-order numerical methods for unstructured grids combine the superior accuracy of high-order spectral or finite difference methods with the geometric flexibility of low-order finite volume or finite element schemes. The Flux Reconstruction (FR) approach unifies various high-order schemes for unstructured grids within a single framework. Additionally, the FR approach exhibits a significant degree of element locality, and is thus able to run efficiently on modern streaming architectures, such as Graphical Processing Units (GPUs). The aforementioned properties of FR mean it offers a promising route to performing affordable, and hence industrially relevant, scale-resolving simulations of hitherto intractable unsteady flows within the vicinity of real-world engineering geometries. In this paper we present PyFR, an open-source Python based framework for solving advection-diffusion type problems on streaming architectures using the FR approach. The framework is designed to solve a range of governing systems on mixed unstructured grids containing various element types. It is also designed to target a range of hardware platforms via use of an in-built domain specific language based on the Mako templating engine. The current release of PyFR is able to solve the compressible Euler and Navier-Stokes equations on grids of quadrilateral and triangular elements in two dimensions, and hexahedral elements in three dimensions, targeting clusters of CPUs, and NVIDIA GPUs. Results are presented for various benchmark flow problems, single-node performance is discussed, and scalability of the code is demonstrated on up to 104 NVIDIA M2090 GPUs. The software is freely available under a 3-Clause New Style BSD license (see www.pyfr.org).

Unusual features Code makes extensive use of symbolic manipulation and runtime code generation through a domain specific language.
Running time Many small problems can be solved on a recent workstation in minutes to hours.

Nomenclature
Throughout we adopt a convention in which dummy indices on the right hand side of an expression are summed. For example C i jk = A i jl B ilk ≡ l A i jl B ilk where the limits are implied from the surrounding context. All indices are assumed to be zero-based. A quantity in transformed spacê A vector quantity of unit magnitude T Transpose (u) A quantity at a solution point ( f ) A quantity at a flux point ( f ⊥ ) A normal quantity at a flux point Common solution at an interface F α Common normal flux at an interface

Introduction
There is an increasing desire amongst industrial practitioners of computational fluid dynamics (CFD) to undertake high-fidelity scale-resolving simulations of transient compressible flows within the vicinity of complex geometries. For example, to improve the design of next generation unmanned aerial vehicles (UAVs), there exists a need to perform simulations-at Reynolds numbers 10 4 -10 7 and Mach numbers M ∼ 0.1-1.0-of highly separated flow over deployed spoilers/air-brakes; separated flow within serpentine intake ducts; acoustic loading in weapons bays; and flow over entire UAV configurations at off-design conditions. Unfortunately, current generation industry-standard CFD software based on first-or second-order accurate Reynolds Averaged Navier-Stokes (RANS) approaches is not well suited to performing such simulations. Henceforth, there has been significant interest in the potential of high-order accurate methods for unstructured mixed grids, and whether they can offer an efficient route to performing scale-resolving simulations within the vicinity of complex geometries. Popular examples of high-order schemes for unstructured mixed grids include the discontinuous Galerkin (DG) method, first introduced by Reed and Hill [1], and the spectral difference (SD) methods originally proposed under the moniker 'staggered-gird Chebyshev multidomain methods' by Kopriva and Kolias in 1996 [2] and later popularised by Sun et al. [3]. In 2007 Huynh proposed the flux reconstruction (FR) approach [4]; a unifying framework for high-order schemes for unstructured grids that incorporates both the nodal DG schemes of [5] and, at least for a linear flux function, any SD scheme. In addition to offering high-order accuracy on unstructured mixed grids, FR schemes are also compact in space, and thus when combined with explicit time marching offer a significant degree of element locality. As such, explicit high-order FR schemes are characterised by a large degree of structured computation.
Over the past two decades improvements in the arithmetic capabilities of processors have significantly outpaced advances in random access memory. Algorithms which have traditionally been compute bound-such as dense matrix-vector products-are now limited instead by the bandwidth to/from memory. This is epitomised in Figure 1. Whereas the processors of two decades ago had FLOPS-per-byte of ∼0.2 more recent chips have ratios upwards of ∼4. This disparity is not limited to just conventional CPUs. Massively parallel accelerators and co-processors such as the NVIDIA K20X and Intel Xeon Phi 5110P have ratios of 5.24 and 3.16, respectively.
A concomitant of this disparity is that modern hardware architectures are highly dependent on a combination of high speed caches and/or shared memory to maintain throughput. However, for an algorithm to utilise these efficiently its memory access pattern must exhibit a degree of either spatial or temporal locality. To a first-order approximation the spatial locality of a method is inversely proportional to the amount of memory indirection. On an unstructured grid indirection arises whenever there is coupling between elements. This is potentially a problem for discretisations whose stencil is not compact. Coupling also arises in the context of implicit time stepping schemes. Implementations are therefore very often bound by memory bandwidth. As a secondary trend we note that the manner in which FLOPS are realised has also changed. In the early 1990s commodity CPUs were predominantly scalar with a single core of execution. However in 2013 processors with eight or more cores are not uncommon. Moreover, the cores on modern processors almost always contain vector processing units. Vector lengths up to 256-bits, which permit up to four double precision values to be operated on at once, are not uncommon. It is therefore imperative that compute-bound algorithms are amenable to both multithreading and vectorisation. A versatile means of accomplishing this is by breaking the computation down into multiple, necessarily independent, streams. By virtue of their independence these streams can be readily divided up between cores and vector lanes. This leads directly to the concept of stream processing. We will refer to architectures amenable to this form of parallelisation as streaming architectures.
A corollary of the above discussion is that compute intensive discretisations which can be formulated within the stream processing paradigm are well suited to acceleration on current-and likely future-hardware platforms. The FR approach combined with explicit time stepping is an archetypical of this.
Our objective in this paper is to present PyFR, an open-source Python based framework for solving advection-diffusion type problems on streaming architectures using the FR approach. The framework is designed to solve a range of governing systems on mixed unstructured grids containing various element types. It is also designed to target a range of hardware platforms via use of an in-built domain specific language derived from the Mako templating engine. The current release of PyFR is able to solve the compressible Euler and Navier-Stokes equations on unstructured grids of quadrilateral and triangular elements in two-dimensions, and unstructured grids of hexahedral elements in three-dimensions, targeting clusters of CPUs, and NVIDIA GPUs. The paper is structured as follows. In section 2 we provide a overview of the FR approach for advection-diffusion type problems on mixed unstructured grids. In section 3 we proceed to describe our implementation strategy, and in section 4 we present the Euler and Navier-Stokes equations, which are solved by the current release of PyFR. The framework is then validated in section 5, single-node performance is discussed in section 6, and scalability of the code is demonstrated on up to 104 NVIDIA M2090 GPUs in section 7. Finally, conclusions are drawn in section 8.
Consider the following advection-diffusion problem inside an arbitrary domain is the flux of this conserved quantity and x = x i ∈ R N D . In defining the flux we have taken u in its unscripted form to refer to all of the N V field variables and ∇u to be an object of length N D × N V consisting of the gradient of each field variable. We start by rewriting Equation 1 as a first order system where q is an auxiliary variable. Here, as with ∇u, we have taken q in its unsubscripted form to refer to the gradients of all of the field variables. Take E to be the set of available element types in R N D . Examples include quadrilaterals and triangles in two dimensions and hexahedra, prisms, pyramids and tetrahedra in three dimensions. Consider using these various elements types to construct a conformal mesh of the domain such that and Ω e = where Ω e refers to all of the elements of type e inside of the domain, |Ω e | is the number of elements of this type in the decomposition, and n is an index running over these elements with 0 ≤ n < |Ω e |. Inside each element Ω en we require that It is convenient, for reasons of both mathematical simplicity and computational efficiency, to work in a transformed space. We accomplish this by introducing, for each element type, a standard elementΩ e which exists in a transformed space, x =x i . Next, assume the existence of a mapping function for each element such that along with the relevant Jacobian matrices These definitions provide us with a means of transforming quantities to and from standard element space. Taking the transformed solution, flux, and gradients inside each element to bẽ and letting∇ = ∂/∂x i , it can be readily verified that as required. We note here the decision to multiply the first equation through by a factor of J −1 en . Doing so has the effect of takingũ en → u en which allows us to work in terms of the physical solution. This is more convenient from a computational standpoint.
We next proceed to associate a set of solution points with each standard element. For each type e ∈ E take {x (u) eρ } to be the chosen set of points where 0 ≤ ρ < N (u) e (℘). These points can then be used to construct a nodal basis set { (u) eρ (x) } with the property that (u) eρ (x (u) eσ ) = δ ρσ . To obtain such a set we first take { ψ eσ (x) } to be any basis which spans a selected order ℘ polynomial space defined insideΩ e . Next we compute the elements of the generalised Vandermonde matrix V eρσ = ψ eρ (x (u) eσ ). With these a nodal basis set can be constructed as (u) eρ (x) = V −1 eρσ ψ eσ (x). Along with the solution points inside of each element we also define a set of flux points on ∂Ω e . We denote the flux points for a particular element type as {x . Let the set of corresponding normalised outward-pointing normal vectors be given by {n A pictorial illustration of this can be seen in Figure 2.
The first step in the FR approach is to go from the discontinuous solution at the solution points to the discontinuous solution at the flux points where u (u) eρnα is an approximate solution of field variable α inside of the nth element of type e at solution pointx (u) eρ . This can then be used to compute a common solution where C α (u L , u R ) is a scalar function that given two values at a point returns a common value. Here we have taken eρn to be the element type, flux point number and element number of the adjoining point at the interface. Since grids in FR are permitted to be unstructured the relationship between eρn and eρn is indirect. This necessitates the use of a lookup table. As the common solution function is permitted to perform upwinding or downwinding of the solution it is in general the case that Hence, it is important that each flux point pair only be visited once with the same common solution value assigned to both C α u Further, associated with each flux point is a vector correction function g with a divergence that sits in the same polynomial space as the solution. Using these fields we can express the solution to Equation 5b as where the term inside the curly brackets is the 'jump' at the interface and the final term is an order ℘ − 1 approximation of the gradient obtained by differentiating the discontinuous solution polynomial. Following the approaches of Kopriva [15] and Sun et al. [3] we can now compute physical gradients as where J −T (u) eσn = J −T en (x (u) eσ ). Having solved the auxiliary equation we are now able to evaluate the transformed flux where J (u) eρn = det J en (x (u) eρ ). This can be seen to be a collocation projection of the flux. With this it is possible to compute the normal transformed flux at each of the flux pointsf Considering the physical normals at the flux points we see that which is the outward facing normal vector in physical space where n ( f ) eσn > 0 is defined as the magnitude. As the interfaces between two elements conform we must eσn . With these definitions we are now in a position to specify an expression for the common normal flux at a flux point pair as The eσnα arises from the desire for the resulting numerical scheme to be conservative; a net outward flux from one element must be balanced by a corresponding inward flux on the adjoining element. It follows that The common normal fluxes in Equation 15 can now be taken into transformed space via F αf where J . It is now possible to compute an approximation for the divergence of the continuous flux. The procedure is directly analogous to the one used to calculate the transformed gradient in Equation 9 which can then be used to obtain a semi-discretised form of the governing system where eρn . This semi-discretised form is simply a system of ordinary differential equations in t and can be solved using one of a number of schemes, e.g. a classical fourth order Runge-Kutta (RK4) scheme.

Overview
PyFR is a Python based implementation of the FR approach described in section section 2. It is designed to be compact, efficient, and platform portable. Key functionality is summarised in table Table 1.
The majority of operations within an FR step can be cast in terms of matrixmatrix multiplications, as detailed in Appendix A. All remaining operations (e.g. flux evaluations) are point-wise, concerning themselves with either a single solution point inside of an element or two collocating flux points at an interface. Hence, in broad terms, there are five salient aspects of an FR implementation, specifically i.) definition of the constant operator matrices detailed in Appendix A, ii.) specification of the state matrices detailed in Appendix A, iii.) implementation of matrix multiply kernels, iv.) implementation of point-wise kernels, and finally v.) handling of distributed memory parallelism and scheduling of kernel invocations. Details regarding how each of the above were achieved in PyFR are presented below.

Definition of Constant Operator Matrices
Setup of the seven constant operator matrices detailed in Appendix A requires evaluation of various polynomial expressions, and their derivatives, at solution/flux points within each type of standard element. Although conceptually simple, such operations can be cumbersome to code. To keep the codebase compact PyFR makes extensive use of symbolic manipulation via SymPy [16], which brings computer algebra facilities similar to those found in Maple and Mathematica to Python. SymPy has built-in support for most common polynomials and can readily evaluate such expressions to arbitrary precision. Efficiency of the setup phase is not critical, since the operations are only performed once at start-up. Since efficiency is not critical, platform portability is effectively achieved by running such operations on the host CPU in all cases.

Specification of State Matrices
In specifying the state matrices detailed in Appendix A there is a degree of freedom regarding how the field variables of each element are packed along a row. The packing of field variables can be characterised by considering the distance, ∆ j (in columns) between two subsequent field variables for a given element. The case of ∆ j = 1 corresponds to the array of structures (AoS) packing whereas the choice of ∆ j = |Ω e | leads to the structure of arrays (SoA) packing. A hybrid approach wherein ∆ j = k with k being constant results in the AoSoA(k) approach. An implementation is free to chose between any of these counting patterns so long as it is consistent. For simplicity PyFR uses the SoA packing order across all platforms.

Matrix Multiplication Kernels
PyFR defers matrix multiplication to the GEMM family of sub-routies provided a suitable Basic Linear Algebra Subprograms (BLAS) library. BLAS is available for virtually all platforms and optimised versions are often maintained by the hardware vendors themselves (e.g. cuBLAS for Nvidia GPUs). This approach greatly facilitates development of efficient and platform portable code. We note, however, that the matrix sizes encountered in PyFR are not necessarily optimal from a GEMM perspective. Specifically, GEMM is optimised for the multiplication of large square matrices, whereas the constant operator matrixes in PyFR are 'small and square' with 10-100 rows/columns, and the state matrices are 'short and fat' with 10-100 rows and 10 000-100 000 columns. Moreover, we note that the constant operator matrices are know a priori, and do not change in time. This a priori knowledge could, in theory, be leveraged to design bespoke matrix multiply kernels that are more efficient than GEMM. Development of such bespoke kernels will be a topic of future research -with results easily integrated into PyFR as an optional replacement for GEMM.

Point-Wise Kernels
Point-wise kernels are specified using a domain specific language implemented in PyFR atop of the Mako templating engine [17]. The templated kernels are then interpreted at runtime, converted to low-level code, compiled, linked and loaded. Currently the templating engine can generate C/OpenMP to target CPUs, and CUDA (via the PyCUDA wrapper [18]) to target Nvidia GPUs. Use of a domain specific language avoids implementation of each point-wise kernel for each target platform; keeping the codebase compact and platform portable. Runtime code generation also means it is possible to instruct the compiler to emit binaries which are optimised for the current hardware architecture. Such optimisations can result in anything up to a fourfold improvement in performance when compared with architectural defaults.
As an example of a point-wise kernel we consider the evaluation of the right hand side of Equation 19, which reads −J −1 (u) eρn (∇ ·f) (u) eρnα . The operation consists of a point-wise multiplication between the negative reciprocal of the Jacobian and the transformed divergence of the flux at each solution point. Figure 3 shows how such a kernel can be expressed in the domain specific language of PyFR. There are several points of note. Firstly, the kernel is purely scalar in nature. This is by design; in PyFR point-wise kernels need only prescribe the point-wise operation to be applied. Important choices such as how to vectorise a given operation or how to gather data from memory are all delegated to templating engine. Secondly, we note it is possible to utilise Python when generating the main body of kernels. This capability is showcased on lines four, five and six where it is used to unroll a for loop over each of the field variables. Finally, we also highlight the use of an abstract data type fpdtype_t for floating point variables which permits a single set of kernels to be used for both single and double precision operation. Generated CUDA source  for this kernel can be seen in Figure 4, and the equivalent C kernel can be found in Figure 5.

Distributed Memory Parallelism and Scheduling
PyFR is capable of operating on heigh performance computing clusters utilising distributed memory parallelism. This is accomplished through the Message Passing Interface (MPI). All MPI functionality is implemented at the Python level through the mpi4py [19] wrapper. To enhance the scalability of the code care has been taken to ensure that all requests are persistent, point-to-point and non-blocking. Further, the format of data that is shared between ranks has been made backend independent. It is therefore possible to deploy PyFR on heterogeneous clusters consisting of both conventional CPUs and accelerators.
The arrangement of kernel calls required to solve an advection-diffusion problem can be seen in Figure 6. Our primary objective when scheduling kernels was to maximise the potential for overlapping communication with computation. In order to help achieve this the common interface solution, C α , and common interface flux, F α , kernels have been broken apart into two separate kernels; suffixed in the figure by int and mpi. PyFR is therefore able to perform a significant degree of rank-local computation while the relevant ghost states are being exchanged.
Our secondary objective when scheduling kernels was to minimise the amount of temporary storage required during the evaluation of −∇ · f. Such optimisations are critical within the context of accelerators which often have an order of magnitude less memory than a contemporary platform. In order to help achieve this U (u) ,R (u) , and R (u) are allowed to alias. By permitting the same storage location to be used for both the inputted solution and the outputted flux divergence it is possible to reduce the storage requirements of the RK schemes. Another opportunity for memory reuse is in the transformed flux function where the incoming gradients, Q (u) , can be overwritten with the transformed flux,F (u) . A similar approach can be used in the common interface flux function whereby U ( f ) can updated in-place with the entires ofD ( f ) which holds the transformed common normal flux. Moreover, C ( f ) is also able to utilise the same storage as the somewhat larger Q ( f ) array. These optimisations allow PyFR to process over 100 000 curved, unstructured, hexahedral

Overview
PyFR is a framework for solving various advection-diffusion type problems. In the current release of PyFR two specific governing systems can be solved, specifically the Euler equations for inviscid compressible flow, and the compressible Navier-Stokes equations for viscous compressible flow. Details regarding both are given below.

Euler Equations
Using the framework introduced in section 2 the three dimensional Euler equations can be expressed in conservative form as with u and f together satisfying Equation 1. In the above ρ is the mass density of the fluid, v = (v x , v y , v z ) T is the fluid velocity vector, E is the total energy per unit volume and p is the pressure. For a perfect gas the pressure and total energy can be related by the ideal gas law with γ = C p /C v . With the fluxes specified all that remains is to prescribe a method for computing the common normal flux, F α , at interfaces as defined in Equation 15. This can be accomplished using an approximate Riemann solver for the Euler equations. There exist a variety of such solvers as detailed in [20]. A description of those implemented in PyFR can be found in Appendix B.

Compressible Navier-Stokes Equations
The compressible Navier-Stokes equations can be viewed as an extension of the Euler equations via the inclusion of viscous terms. Within the framework outlined above the flux now takes the form of In the above we have defined ∆ = µC p /P r where µ is the dynamic viscosity and P r is the Prandtl number. The components of the stress-energy tensor are given by Using the ideal gas law the temperature can be expressed as with partial derivatives thereof being given according to the quotient rule.
Since the Navier-Stokes equations are an advection-diffusion type system it is necessary to both compute a common solution (C α of Equation 7) at element boundaries and augment the inviscid Riemann solver to handle the viscous part of the flux. A popular approach is the LDG method as presented in [5,13]. In this approach the common solution is given ∀α according to where β controls the degree of upwinding/downwinding. The common normal interface flux is then prescribed, once again ∀α, according to where F (inv) is a suitable inviscid Riemann solver (see Appendix B) and with τ being a penalty parameter, f (vis) . We observe here that if the common solution is upwinded then the common normal flux will be downwinded. Generally, β = ±1/2 as this results in the numerical scheme having a compact stencil and 0 ≤ τ ≤ 1.

Presentation in Two Dimensions
The above prescriptions of the Euler and Navier-Stokes equations are valid for the case of N D = 3. The two dimensional formulation can be recovered by deleting the fourth rows in the definitions of u, f (inv) and f (vis) along with the third columns of f (inv) and f (vis) . Vectors are now two dimensional with the velocity being given by v = (v x , v y ) T .

Euler Equations: Euler Vortex Super Accuracy
Various authors [4,10] have shown FR schemes exhibit so-called 'super accuracy' (an order of accuracy greater than the expected ℘ + 1). To confirm PyFR can achieve super accuracy for the Euler equations a square domain Ω = [−20, 20] 2 was decomposed into four structured quad meshes with spacings of h = 1/3, h = 2/7, h = 1/4, and h = 2/9. Initial conditions were taken to be those of an isentropic Euler vortex in a free-stream where f = (1 − x 2 − y 2 )/2R 2 , S = 13.5 is the strength of the vortex, M = 0.4 is the free-stream Mach number, and R = 1.5 is the radius. All meshes were configured with periodic boundary conditions along boundaries of constant x. Along boundaries of constant y the dynamical variables were fixed according to which are simply the limiting values of the initial conditions. Strictly speaking these conditions, on account of the periodicity, result in the modelling of an infinite array of coupled vortices. The impact of this is mitigated by the observation that the exponentially decaying vortex has a characteristic radius which is far smaller than the extent of the domain. Neglecting these effects the analytic solution of the system is a time t is simply a translation of the initial conditions. Using the analytical solution we can define an L 2 error as where ρ δ (x, t) is the numerical mass density, ρ(x, t = 0) the analytic mass density, and ∆ y (t) is the ordinate corresponding to the centre of the vortex at a time t and accounts for the fact that the vortex is translating in a free stream velocity of unity in the y direction. Restricting the region of consideration to a small box centred around the origin serves to further mitigate against the effects of vortices coupling together. The initial mass density along with the [−2, −2] × [2, 2] region used to evaluate the error can be seen in Figure 7. At times, t c , when the vortex is centred on the box the error can be readily computed by integrating over each element inside the box and summing the residuals where, ρ δ i (x, t c ) is the approximate mass density inside of the ith element, and J i (x) the associated Jacobian. These integrals can be approximated by applying Gaussian quadrature where {x j } are abscissa and { ω j } the weights of a rule determined for integration inside of a standard quadrilateral. So long as the rule used is of a suitable strength then this will be a very good approximation of the true L 2 error. Following [10] the initial conditions were laid onto the mesh using a collocation projection with ℘ = 3. The simulation was then run with three different flux reconstruction schemes: DG, SD, and HU as defined in [10]. Solution points were placed at a tensor product construction of Gauss-Legendre quadrature points. Common interface fluxes were computed using a Rusanov Riemann solver. To advance the solutions in time a classical fourth order Runge-Kutta method (RK4) was used. The time step was taken to be ∆t = 0.00125 with t = 0..1800 with solutions written out to disk every 32 000 steps. The order of accuracy of the scheme at a particular time can be determined by plotting log σ against log h and performing a least-squares fit through the four data points. The order is given by the gradient of the fit. A plot of order of accuracy against time for the three schemes can be seen in Figure 8. We note that the order of accuracy changes as a function of  Figure 8. Spatial super accuracy observed for a ℘ = 3 simulation using DG, SD and HU as defined in [10].
time. This is due to the fact that the error is actually of the form σ(t) = σ p + σ so (t) where σ p is a constant projection error and σ so is a time-dependent spatial operator error. The projection error arises as a consequence of the forth order collocation projection of the initial conditions onto the mesh. Over time the spatial operator error grows in magnitude and eventually dominates. Only when σ so (t) σ p can the true order of the method be observed. The results here can be seen to be in excellent agreement with those of [10].

Compressible Navier-Stokes Equations: Couette Flow
Consider the case in which two parallel plates of infinite extent are separated by a distance H in the y direction. We treat both plates as isothermal walls at a temperature T w and permit the top plate to move at a velocity v w in the x direction with respect to the bottom plate. For simplicity we shall take the ordinate of the bottom plate as zero. In the case of a constant viscosity µ the Navier-Stokes equations admit an analytical solution in which where φ = y/H and p c is a constant pressure. The total energy is given by the ideal gas law of Equation 21. On a finite domain the Couette flow problem can be modelled through the imposition of periodic boundary conditions. For a two dimensional mesh periodicity is enforced in x whereas for three dimensional meshes it is enforced in both x and z. To validate the Navier-Stokes solver in PyFR we take γ = 1.4, P r = 0.72, µ = 0.417, C p = 1005 J K −1 , H = 1 m, T w = 300 K, p c = 1 × 10 5 Pa, and v w = 69.445 m s −1 . These values correspond to a Mach number of 0.2 and a Reynolds number of 200. The plates were modelled as no-slip isothermal walls as detailed in subsection C.4 of Appendix C. A plot of the resulting energy profile can be seen in Figure 9. Constant initial conditions are taken as ρ = ρ(φ) , v = v wx , and p = p c . Using the analytical solution we again define an L 2 error as where Ω is the computational domain, E δ (x, t) is the numerical total energy, and E(x) the analytic total energy. In the third step we have approximated each integral by a quadrature rule with abscissa {x e j } and weights { ω e j } inside of an element type e. Couette flow is a steady state problem and so in the limit of t → ∞ the numerical total energy should converge to a solution. Starting from a constant initial condition the L 2 error was computed every 0.1 time units. The simulation was said to have converged when σ(t)/σ(t + 0.1) ≤ 1.01 where σ is the L 2 error. We will denote the time at which this occurs by t ∞ .
Once the system has converged for a range of meshes it is possible to compute the order of accuracy of the scheme. For a given ℘ this is the slope (plus or minus a standard error) of a linear least squares fit of log h ∼ log σ(t ∞ ) where h is an approximation of the characteristic grid spacing. The expected order of accuracy is ℘ + 1. In all simulations inviscid fluxes were computed using the Rusanov approach and the LDG parameters were taken to be β = 1/2 and τ = 0.1. All simulations were performed with DG correction functions and at double precision. Inside tensor product elements Gauss-Legendre solution and flux points were employed. Triangular elements utilised Williams-Shunn solution points and Gauss-Legendre flux points.
Two dimensional unstructured mixed mesh. For the two dimensional test cases the computational domain was taken to be [−1, 1] × [0, 1]. This domain was then meshed with both triangles and quadrilaterals at four different refinement levels. The Couette flow problem described above was then solved on each of these meshes. Experimental L 2 errors and orders of accuracy can be seen in Table 2. We note that in all cases the expected order of accuracy was obtained.  where N E is the total number of elements in the mesh. Order 2.21 ± 0.12 2.99 ± 0.32 3.97 ± 0.05 5.20 ± 0.38 Table 3. L 2 energy errors and orders of accuracy for the Couette flow problem on three extruded hexahedral meshes. On account of the extrusion h ∼ N −1/2 E where N E is the total number of elements in the mesh. Order 2.06 ± 0.08 2.87 ± 0.24 3.99 ± 0.03 Figure 11. Cutaway of the unstructured hexahedral mesh with 1004 elements.
a series of hexahedral meshes. Experimental L 2 errors and orders of accuracy for these meshes can be seen in Table 3.
Three dimensional unstructured hexahedral mesh. As a further test a domain of dimension [0, 1] 3 was considered. This domain was meshed using completely unstructured hexahedra. Three levels of refinement were used resulting in meshes with 96, 536 and 1004 elements. A cutaway of the most refined mesh can be seen in Figure 11. Experimental L 2 errors and the resulting orders of accuracy are presented in Table 4. Despite the fully unstructured nature of the mesh the expected order of accuracy was again obtained in all cases. We do, however, note the higher standard errors associated with these results. Table 4. L 2 energy errors and orders of accuracy for the Couette flow problem on three unstructured hexahedral meshes. Mesh spacing was taken as h ∼ N −1/3 E where N E is the total number of elements in the mesh. . This domain was then meshed in the x-y plane with 4661 quadratically curved quadrilateral elements. Next, this grid was extruded along the z-axis to yield a total of 46610 hexahedra. The grid, which can be seen in Figure 12, was partitioned into four pieces. Along surfaces of y = ±10 and x = −18 the inflow boundary condition of subsection C.2 in Appendix C was imposed. Along the surface of x = 30 the outflow condition of subsection C.3 in Appendix C was used. Periodic conditions were imposed in the z direction. On the surface of the cylinder the no-slip isothermal wall condition of subsection C.4 in Appendix C was imposed. The free-stream conditions were taken to be ρ = 1, v =x, and p = 1/γM 2 . These were also used as the initial conditions for the simulation. DG correction functions were used with the LDG parameters being β = 1/2 and τ = 0.1. The ratio of specific heats was taken as γ = 1.4 and the Prandtl number as P r = 0.72.
The simulation was run with ℘ = 4 with four NVIDIA K20c GPUs. It contained some 29×10 6 degrees of freedom. Isosurfaces of density captured after the turbulent wake had fully developed can be seen in Figure 13.

Single Node Performance
The single node performance of PyFR has been evaluated on an NVIDIA M2090 GPU. This accelerator has a theoretical peak double precision floating point performance of 665 GFLOP/s, and when ECC is disabled the theoretical peak memory bandwidth is 177 GB/s. As points of reference we observe that cuBLAS (CUDA 5.5) is able to obtain 407 GFLOP/s when multiplying a pair of 4096 × 4096 matrices on this hardware, and the maximum device bandwidth obtainable by the bandwidth test application included with the CUDA SDK is 138.9 GiB/s when ECC is disabled.   We shall refer to these values as realisable peaks.
To conduct the evaluation a fully periodic cuboidal domain was meshed with 50 176 hexahedral elements. The double precision Navier-Stokes solver of PyFR was then run on this mesh at orders ℘ = 2, 3, 4 with β = 1/2. In conducting the analysis kernels were grouped into one of three categories: matrix multiplications (DGEMM), point-wise kernels with direct memory access patterns (PD) and pointwise kernels with some level of indirect memory access (PI). Indirection arises in the computation of C α in Equation 7 and F α in Equation 15 and occurs as a consequence of the unstructured nature of PyFR. The resulting breakdowns of wall-clock time, memory bandwidth and floating point operations can be seen in Table 5. It is clear that he majority of floating point operations are concentrated inside the calls to DGEMM with the point-wise operations are heavily memory bandwidth bound. Of this bandwidth some ∼15% was ascribed to register spillage above and beyond that which can be absorbed by the L1 cache.
The high fraction of peak bandwidth obtained by the indirect kernels can be attributed to three factors. Firstly, the constant data required for calculations at ????, such asn eσn , is ordered to ensure direct (coalesced) access. Secondly, at start-up PyFR attempts to determine an iteration ordering over the various flux-point pairs that will minimise the number of cache misses.
Many of the memory accesses are therefore are near-coalesced. Thirdly and finally we highlight the impressive latency-hiding capabilities of the CUDA programming model. In line with expectations the proportion of time spent performing matrix-matrix multiplications increases as a function of order. When going from ℘ = 2 to ℘ = 3 a significant portion of the additional compute is offset by the improved performance of cuBLAS. However, when ℘ = 4 the performance of these kernels in absolute terms can be seen to regress slightly. This contributes to the greatly increased fraction of wall-clock time spent inside of these kernels. Nevertheless, the achieved rate of 305.4GFLOP/s is still over 75% of the realisable peak. Also in line with expectations is the invariance of the arithmetic performance of the point-wise kernels with respect to order. As the order is varied all that changes is the number of points to be processed with the operation itself remaining identical.

Scalability
The scalability of PyFR has been evaluated on the Emerald GPU cluster. It is housed at the STFC Rutherford Appleton Laboratory and based around 60 HP SL390 nodes with three NVIDIA M2090 GPUs and 24 HP SL390 nodes with eight NVIDIA M2090 GPUs. Nodes are connected by QDR InfiniBand.
For simplicity all runs herein were performed on the eight GPU nodes. As a starting point a domain of dimension [−16, 16] × [−16, 16] × [0, 1.75] was meshed isotropically with N E = 114 688 structured hexahedral elements. The mesh was configured with completely periodic boundary conditions. When run with the Navier-Stokes solver in PyFR with ℘ = 3 the mesh gives a working set of ∼4720 MiB. This is sufficient to 90% load an M2090 which when ECC is enabled has ∼5250 MiB memory available to the user. When examining the scalability of a code there are two commonly used metrics. The first of these is weak scalability in which the size of the target problem is increased in proportion to the number of ranks N with N E ∝ N. For a code with perfect weak scalability the runtime should remain unchanged as more ranks are added. The second metric is strong scalability wherein the problem size is fixed and the speedup compared to a single rank is assessed. Perfect strong scalability implies that the runtime scales as 1/N.
For the domain outlined above weak scalability was evaluated by increasing the dimensions of the domain according to [−16, 16] × N[−16, 16] × [0, 1.75]. This extension permitted the domain to be trivially decomposed along the y-axis. The resulting runtimes for 1 ≤ N ≤ 104 can be seen in Table 6. We note that in the N = 104 case that the simulation consisted of some 3.8 × 10 9 degrees of freedom with a working set of ∼485 GiB. To study the strong scalability the initial domain was partitioned along the xand y-axes. Each partition contained exactly N E /Ns. The resulting speedups for 1 ≤ N ≤ 32 can be seen in Table 7. Up to eight GPUs scalability can be seen to be near perfect. Beyond this the relationship begins to break down. When N = 32 an improvement of 26 can be observed. However, in this case each GPU is loaded to less than 3% and so the result is to be expected.

Conclusions
In this paper we have described PyFR, an open source Python based framework for solving advection-diffusion type problems on streaming architectures. The structure and ethos of PyFR has been explained including our methodology for targeting multiple hardware platforms. We have shown that PyFR exhibits spatial super accuracy when solving the 2D Euler equations and the expected order of accuracy when solving Couette flow problem on a range of grids in 2D and 3D. Qualitative results for unsteady 3D viscous flow problems on curved grids have also been presented. Performance of PyFR has been validated on an NVIDIA M2090 GPU in three dimensions. It has been shown that the compute bound kernels are able to obtain between 50% and 90% of realisable peak FLOP/s whereas the bandwidth bound point-wise kernels are, across the board, able to obtain in excess of 89% realisable peak bandwidth. The scalability of PyFR has been demonstrated in the strong sense up to 32 NVIDIA M2090s and in the weak sense up to 104 NVIDIA M2090s when solving the 3D Navier-Stokes equations.

A Matrix Representation
It is possible to cast the majority of operations in an FR step as matrix-matrix multiplications of the form where c 1,2 ∈ R are constants, A is a constant operator matrix, and B and C are state matrices. To accomplish this we start by introducing the following constant operator matrix e , and the following state matrices In specifying the state matrices there is a degree of freedom associated with how the N V field variables for each element are packed along a row of the matrix, with the possible packing choices being discussed in subsection 3.3. Using these matrices we are able to reformulate Equation 6 as In order to apply a similar procedure to Equation 9 we let where we note the block diagonal structure of M 5 e . This is a direct consequence of the above choices for ∆i. Finally, to rewrite Equation 18 we write In the following section we take u L and u R to be the two discontinuous solution states at an interface andn L to be the normal vector associated with the first state.

B.2 Rusanov
Also known as the local Lax-Friedrichs method a Rusanov type Riemann solver imposes inviscid numerical interface fluxes according to where s is an estimate of the maximum wave speed C Boundary Conditions

C.1 Overview
To incorporate boundary conditions into the FR approach we introduce a set of boundary interface types b ∈ B. At a boundary interface there is only a single flux point: that which belongs to the element whose edge/face is on the boundary. Associated with each boundary type are a pair of functions C (b) α (u L ) and F (b) α (u L , q L ,n L ) where u L , q L , andn L are the solution, solution gradient and unit normals at the relevant flux point. These functions prescribe the common solutions and normal fluxes, respectively.
Instead of directly imposing solutions and normal fluxes it is oftentimes more convenient for a boundary to instead provide ghost states. In its simplest formulation where B (b) u L is the ghost solution state and B (b) q L is the ghost solution gradient. It is straightforward to extend this prescription to allow for the provisioning of different ghost solution states for C α and F α and to permit B (b) q L to be a function of u L in addition to q L .

C.2 Supersonic Inflow
The supersonic inflow condition is parameterised by a free-stream density ρ f , velocity v f , and pressure p f .

C.3 Subsonic Outflow
Subsonic outflow boundaries are parameterised by a free-stream pressure p f .

C.4 No-slip Isothermal Wall
The no-slip isothermal wall condition depends on the wall temperature C p T w and the wall velocity v w . Usually v w = 0.