Optimizing the performance of the spectral/ hp element method with collective linear algebra operations

.


Introduction
The spectral/hp element method is now a well-established high-order finite element approach that is gaining popularity within both academia and industry. Traditionally, this method has been confined to the field of computational fluid dynamics [13], with recent advances demonstrating the applicability of the method to increasingly challenging problems within the aeronautics and automotive communities [16]. This has now lead to increased interest in the spectral/hp method across a broad range of application areas, including cardiac electrophysiology [6], solid mechanics [17,18], porous media [7] and oceanographic modelling [10]. This can be attributed to the attractive features of the method: it combines the convenient geometric flexibility found in linear finite element methods with spectral-like exponential convergence in the polynomial order p. Software such as Nektar++ [3], an opensource framework designed to provide an easy-to-use, e cient and modern environment for the development of high-order finite element methods, has also helped to abate the level of di culty in using these methods and exposed them to a wider audience.
From a computational perspective, recent evidence suggests that high-order methods are well-placed to take advantage of modern multi-core [28] and many-core [26,15] computing hardware. For large-scale simulations, which are becoming increasingly commonplace in scientific studies, nodes containing many-core processors form the backbone of current-day high-performance computing (HPC) infrastructure. The use of spectral/hp element methods is now routine on between O(10 4 ) and O(10 5 ) cores in tackling challenging aeronautical flow simulations on large unstructured grids [16], with other codes such as Nek5000 scaling to O(10 6 ) cores through the use of tensor-product hexahedra [14]. Techniques for optimising the performance of these methods on a single compute node are therefore highly important in the context of reducing execution times.
A significant proportion of the runtime, both in the spectral/hp element method and in lower-order methods, is usually spent in the evaluation of a discretised mathematical operator across the mesh of the domain, which takes the form of a matrix-vector multiplication. In the classic linear finite element method, this matrix is usually extremely sparse, leading to the use of an appropriate sparse-matrix format. On the other hand, matrices that represent discretised high-order operators on a single element possess far more structure, particularly at higher polynomial orders. This leads to a wider range of implementation strategies [24,5,4], in which the elemental matrices need not be assembled to construct the global matrix.
In the context of modern hardware, this distinction of approaches is very important. On a typical node, memory bandwidth is rapidly becoming the deciding factor in attaining peak CPU performance. With core counts increasing and memory hierarchies becoming more complex, data must be managed e ciently, through the use of caching-friendly algorithms, to attain the peak performance of the hardware. Methods that minimize indirection and cache misses can take advantage of this environment, leading to increased e ciency and shorter runtimes. High-order methods are therefore well-suited to this new landscape since they can utilise dense operations that are typically far more memory-compact than sparse operations. Whilst some e↵orts have been made to develop auto-tuning and automatic compilation to maximize floating point operation counts [22] on di↵erent hardware types, thus far, an investigation into the e↵ects of compacting data and minimizing data transfer has not been thoroughly investigated in the context of high-order methods.
The aim of this paper is therefore to bridge this gap in current understanding, by utilising the mathematical formulation of the spectral/hp element method in order to improve memory e ciency. We present a methodology based on the amalgamation of the action of multiple elemental operators into larger matrix-vector and matrix-matrix multiplications. In particular, we show how the mathematical construction of the method admits several different amalgamation schemes, some (but not all) of which are similar to those described in [24,5,4], but formulated in a multi-element setting. We demonstrate how the use of these schemes can significantly increase the performance of the method, thereby decreasing the runtime of solvers that are based on this technology. We consider the e↵ects of these schemes on distributed memory hardware architectures, focusing on the performance e↵ects on a single node vs. communication costs between nodes which remains the same regardless of the scheme chosen. The choice of the most optimal scheme is highly dependent not only on the polynomial order, but many other factors such as the underlying hardware, choice of BLAS implementation and the problem to be considered. We therefore present an auto-tuning method designed to rapidly and automatically select the most e cient scheme at runtime on a per-operator basis, overcoming a significant barrier to the usage of these techniques for users. Overall, we show that whilst the e↵ect of amalgamating these operators inevitably leads to a limited increase in the memory footprint in certain cases, significant performance gains can be achieved due to the increase in data transfer e ciency.

Existing work, novelty and applicability
In the context of finite elements, the idea of grouping together elements of identical structure in order to act on them 'collectively' has been examined previously under the nomenclature of 'worksets', as seen for example in [23] and [20]. With the advent of GPU-based methods, similar techniques have been adopted to align with the kernel-based nature of the resulting implementations [26]. However, we emphasise that the purpose and novelty of this paper is not therefore in the presentation of the amalgamation approach, but rather in how the mathematical formulation of the spectral/hp element method admits schemes of di↵erent implementation (but the same mathematical operation) under the action of amalgamation. We will demonstrate that, across a range of polynomial orders and element types, the performance of the method can di↵er drastically. Depending on the amalgamation scheme chosen, this di↵erence can be as much as an order of magnitude in terms of the evaluation time. This work therefore noteably extends the body of knowledge relating to multi-element operations and their implementation in the high-order context, and is relevant to a number of widely used high-order variants, including element-based continuous Galerkin, discontinuous Galerkin and flux reconstruction. However we do not encompass the global operators which were the focus of previous work.
It is also important to note that the underlying concept of amalgamation of elemental operators has performance benefits for the finite element formulation, be that in the context of either low-or high-order methods. Since the operators we consider lie at the very heart of the finite element formulation, such as taking derivatives and inner products, the methodology we present is applicable to a wide variety of finite element software packages. In terms of high-order methods, which are becoming increasingly prevalent throughout the scientific community, amalgamation schemes are certainly not limited to Nektar++. Many other highorder codes are based around the same types of operator evaluations, and so can benefit from the results we present here. However, depending on the precise nature of the formulation being used, not all of the schemes we present may be applicable. We discuss this further in the conclusions section.

Summary
The paper is structured as follows. In the next section, we give a brief overview of the mathematical formulation of the spectral/hp element method and describe the amalgamation schemes. In Section 3, we present initial results demonstrating how these collections can be used to improve performance when applied to a complex three-dimensional mesh. We additionally show the e↵ects of some parameter changes, such as the BLAS implementation, highlighting the need for the auto-tuning strategy that is presented in Section 4. We investigate further detailed hardware metrics such as FLOPS and memory bandwidth attained by the schemes in Section 5. Finally, in Section 6 we apply all of these techniques to a real-world large-scale simulation of compressible flow over an ONERA M6 wing, demon-strating the practical reduction in runtime that can be achieved in using the amalgamation techniques. We give some brief conclusions in Section 7.

Collective operations
In this section we outline the operations that will be amalgamated and highlight their role in the underlying method. The starting point for this is a brief overview of the formulation of the spectral/hp element method, concentrating on the discretisation of a single element in three dimensions. A more detailed derivation can be found in [13]. We then outline our strategies for amalgamation.

Spectral/ hp formulation
As with any other finite element method, the spectral/hp element method begins with the tessellation of a computational domain into a mesh ⌦ consisting of N non-overlapping elements ⌦ e such that ⌦ = S N e=1 ⌦ e . The first step of the formulation is to consider the expansion of a function u on a standard element ⌦ st , on which the basis functions and basic operations such as integration and di↵erentiation are defined. In three dimensions we consider either hexahedral, prismatic or tetrahedral elements whose standard regions are defined as A scalar function u is represented on a standard region through a summation of polynomial basis functions i as st and I is an indexing set that depends on element type and possibly heterogeneous polynomial orders P i in each coordinate direction, Each element therefore has a number of elemental degrees of freedom N dof = |I|. The key part of the spectral/hp element method is that, unlike other high-order finite element methods, the basis on each element is expressed as a tensorial expansion of one-dimensional functions, even for elements such as the prism and tetrahedron. In the hexahedral element this is a straightforward tensor product, leading to a decomposition of the form . A function on the standard hexahedron is therefore expressed through the summation with collapsed coordinates (⌘ 1 , ⌘ 2 ) to triangular element ⌦ tri st with coordinates (⇠ 1 , ⇠ 2 ). We also show how ⌦ tri st is taken to a general curvilinear element through the isoparametric mapping~ . Interior lines demonstrate how quadrature points, taken to be equally spaced in this example, are transformed under each mapping.
In tetrahedral and prismatic elements, we use a collapsed coordinate system (⌘ 1 , ⌘ 2 , ⌘ 3 ) 2 [ 1, 1] 3 , obtained through one or more applications of the square-to-triangle Du↵y transformation [9], to obtain a similar tensorial decomposition. Figure 1 demonstrates the Du↵y transformation in a two-dimensional setting through the mapping Consequently, the summation over each mode i as a function of p, q and r, then becomes more complex, as indicated by the indexing sets above. For example, a prismatic element has the general expansion where ⌘ i 2 [ 1, 1] is the collapsed coordinate and, additionally, b depends on both p and r. We note that although the number of modes has been reduced, we still maintain a complete polynomial space on ⌦ st . The basis functions a and b are the tensor-product modified hierarchical basis functions defined in [13], which combine a set of orthogonal Jacobi polynomials with linear finite element basis functions to achieve a separation of boundary and interior modes.
We must additionally discuss how a given PDE will be discretised. For the problems we consider here, the continuous or discontinuous Galerkin methods will be used, where the PDE is transformed into a weak form involving the L 2 inner product of functions lying in polynomial test and trial spaces. We therefore equip each element type with a distribution of N Q quadrature points (⇠ 1j , ⇠ 2j , ⇠ 3j ) and weights w j to numerically calculate the resulting integrals. In the hexahedron, the tensor-product expansion means our choice of quadrature is straightforward: we select Q i Gauss-Lobatto-Legendre points, with Q i = P i + 2 being large enough to exactly evaluate the integral of polynomial products in the i-th coordinate direction for straight-sided elements. Owing to the use of the Du↵y transformation in the other elements, a singularity arises at the collapsed vertex of triangles. We therefore select Q i = P i +1 Gauss-Radau points in these directions to avoid this issue. However, we note that for all element types, a tensor product of quadrature points is used so that N Q = Q 1 Q 2 Q 3 , as can be seen in Figure 1.
Finally, the standard region must be mapped to a general element ⌦ e ⇢ ⌦. The reference element co-ordinates⇠ 2 ⌦ st are mapped to the Cartesian coordinatesx = ( . Each component i is isoparametric, so that it is written as an expansion (1) in terms of the modified basis functions . We may then determine geometric factors such as the Jacobian J ⇠ and its determinant, which are used to calculate, for example, integrals and derivatives on ⌦ e through an application of the chain rule. In general, elements may be straight-sided, in which case J ⇠ is constant, or curvilinear so that J ⇠ is a function of ⇠. Figure 1 demonstrates a mapping from a ⌦ st into a curvilinear element.

Operator evaluation strategies
In previous work in two [24] and three dimensions [5,4], various evaluation strategies were devised for key discretised mathematical operators. To illustrate this, we examine the most straightforward operation: the backward transformation as shown in Equation (2), which recovers the physical solution u given the coe cientsû pqr . In the case of a nodal basis, where p is equipped with a set of (possibly non-tensorial) quadrature points {⇣ q } and obeys the Lagrange interpolant property that p (⇣ q ) = pq , this is equivalent to an interpolation operator. For simplicity we consider the hexahedral element type.
A naive implementation of this triple summation at each of the quadrature points leads to an evaluation involving O(P 6 ) floating point operations, where we assume that P i and Q i are all of the same order P . In the local evaluation strategy, we reformulate the summation as a matrix-vector product Bû, where B = [ j (⇠ i )] ij is a N Q ⇥ N dof matrix that stores the evaluation of the basis functions at each quadrature point andû is a vector of the coe cientŝ u pqr , ordered so that p runs fastest, followed by q and then r. This permits the use of an e cient linear algebra library to evaluate the matrix-vector product, potentially increasing performance over a naive implementation, as these packages tend to be highly optimized over a range of computational hardware.
An alternative approach is to apply the technique of sum-factorization [19]. In this evaluation strategy, the tensor-product expansion of the basis functions i is leveraged to decrease the number of operations and thus the order of complexity of the evaluation. In the hexahedron, this is achieved by rewriting (2) as We see that if the bracketed summations are stored in memory, the application of sumfactorisation leads to a reduction to O(P 4 ) floating point operations. A similar technique StdMat: SumFac: Figure 2: Diagramatic representation of each amalgamation scheme. Four quadrilateral elements with P 1 = P 2 = 1 and Q 1 = Q 2 = 3 are considered for the backward transform operator.
can be used for the tetrahedron and prism, although it is typically less e cient than the hexahedron due to the inter-dependency of the p, q and r indices. With a little more work, we can again use linear algebra packages by rewriting the summation as a series of matrixmatrix operations, whereÛ [P 1 ] , for example, denotes the reinterpretation of the vectorû as a P 1 ⇥ P 2 P 3 matrix stored in column-major format. The parentheses also highlight where temporary storage is required to store intermediate steps. Whilst intuition may point towards sum-factorisation being the quickest way to evaluate these operators due to the reduction in operator count, our previous work demonstrates that the fastest technique depends heavily on polynomial order, element type and the type of operator under consideration. This points towards there being the need for a number of di↵erent amalgamation schemes in order to attain optimal performance.

Amalgamation schemes
Our earlier studies applied the strategies of the previous section by iterating over each element, evaluating the operator and measuring the total execution time for the entire mesh. However, in the context of memory e ciency and using the CPU cache e↵ectively, this approach may not prove to be the most optimal if matrices are not stored contiguously in memory. Additionally, since local matrices must be generated for each element, there is a large cost incurred in moving them from main memory to the processor.
The purpose of this work is therefore to reformulate these strategies in the context of multiple elements. We aim to remove local matrices wherever possible, thereby reducing data movement and increasing data locality. We will leverage both the tensor-product decomposition of the spectral/hp element method and the use of a standard region, on which we can define an operator for many elements simultaneously. Then, through grouping local elemental storage of the coe cient and physical spaces, we aim to apply standard level-3 BLAS operations such as dgemm for matrix multiplication wherever possible. These routines, owing to their widespread use, are highly optimised. Indeed, many HPC facilities and vendors supply custom implementations with processor-specific code that can achieve a significant percentage of peak floating-point performance.
The first step is to consider the mathematical principals behind the amalgamation procedure and decide precisely how to group elemental operations together. The defining characteristics of the action of an operator on an element ⌦ e are the element type, the triple (P 1 , P 2 , P 3 ) defining its polynomial order and the geometric mapping e with its derivatives and Jacobian determinant J e . These properties should then dictate the grouping of elements within a given mesh.
However, for the purposes of amalgamating elements, we note that the geometric information can be ignored. Since each element is mapped to a standard element ⌦ st , we may choose to perform our computationally expensive operations on ⌦ st and then apply the geometric information as a secondary step. To demonstrate how this works, we must investigate an alternative to the backwards transformation (1), which does not use the geometric terms. We therefore consider another fundamental operation: calculating the L 2 inner product of a function u with respect to the basis functions e i (x) = i ( e (⇠)) on each element ⌦ e . On a single element, this is defined as , and do the same for J e , we may construct a vector , where w i are the quadrature weights on the standard element. Then (5) can be expressed as the matrix-vector product A = B > w e . We may therefore naturally extend this into a multi-element setting by combining the geometric terms and quadrature weights into a single vector W = [w 1 , . . . , w N ] that spans a group of elements 1  e  N .
Our strategy for amalgamation therefore begins by grouping elements which are of the same type and have the same polynomial order. For N el elements in any given group, the elemental coe cientsû e and physical solution at the quadrature points u e j = u(⇠ 1j , ⇠ 2j , ⇠ 3j )| ⌦ e are stored in a contiguous vector, which we denote byÛ = [û 1 , . . . ,û N el ] and U = [u 1 , . . . , u N el ] respectively. We then consider four di↵erent schemes, which are depicted graphically in Figure 2.

Local sum factorisation
LocalSumFac represents the baseline existing implementation that we will compare against. We iterate over each expansion, applying the sum-factorised version of a given operator. All operations are considered locally and without any amalgamation. Therefore the matrix respresentation of an operator, as well as the Jacobian determinant J e and related geometric factors, are not guaranteed to be contiguous in memory, depending on the implementation choices that have been adopted within the code. However, the use of a per-element geometric information allows us to benefit from a memory improvement, since planar-sided elements have a constant Jacobian determinant.

Iteration over elements with sum-factorisation
In the IterPerExp scheme, we use the local sum-factorisation technique of Equation (4) in a similar fashion to LocalSumFac. However we now combine this with the geometry amalgamation technique outlined above, so that the geometric information is stored contiguously in memory. The standard basis matrices B i are shared between each element; however no particular attention is paid to keep them contiguous in memory. In regards to the geometric terms, we note that on planar-sided elements, J e is usually constant as a function of ⇠. However, to ensure memory is streamed e ciently and to avoid unnecessary branching, we store the J e terms as a contiguous vector of size N el ⇥ N Q .

Standard matrix approach
The aim of the StdMat scheme is to utilise the standard element, by first forming a matrix M on ⌦ st that represents our operator. Considering, for example, the inner product, M has dimensions N Q ⇥ N dof . To calculate the inner product, we first perform a pointwise multiplication (i.e. Hadamard product) of the function u at the quadrature points, U with the vector W that contains the combined Jacobian and quadrature weights, to obtain a vector V. We then calculate the action of M on multiple elements through a matrix-matrix multiplicationÛ ] is the interpretation ofÛ as an N dof ⇥ N el matrix stored in column-major format, so that each column is the vector of elemental coe cientsû e . V N Q is interpreted similarly as an has each column being the resulting elemental inner product with respect to the basis function that has this column as its index. In this case, we can evaluate the inner product with a single call to the BLAS routine dgemm (matrix-matrix multiplication), taking full advantage of the optimised BLAS functionality. We note that one advantage of this method is that its implementation is precisely the same for each element type, making the scheme straightforward to implement. We also eliminate the need for local elemental matrices. However, we do not explicitly use the tensor-product composition property of the elements. This means that, in comparison with approaches that use sum-factorisation, the e↵ective operation count is larger and thus the StdMat scheme may prove more expensive.

Collective sum-factorisation
In the SumFac scheme, we extend the sum-factorisation technique of Equation (4) to multiple elements. As in the previous scheme, the goal is to extend the formulation of Equation (4) through the use of dgemm. However, due to the way that elemental data is laid out in memory, this is not always possible without memory-intensive transposition of the data. We therefore opt to iterate over the elements for certain substeps. For example, the backward transformation on the hexahedron is calculated in four steps: • preallocate a vector U 1 of size P 1 P 2 Q 3 ; • for each element, calculate u e 1 = B 3 (û e [P 3 ] ) > , placing the result in the correct position inside U 1 , which amalgamates the contributions from each element; ] . Whilst the iteration over elements in the first step cannot be avoided, since the B i matrices are re-used, data locality can be ensured, giving the best possible chance for caching to be used e↵ectively. Similar formulations can be made for all of the operators and element types under consideration.

Implementation of amalgamations
To evaluate the action of an operator on the entire mesh, upon setup we ensure that the memory for each collection is arranged contiguously. Then, when an operator evaluation is required as part of a solver, we iterate over each collection group and apply the selected amalgamation scheme. The implementation inside Nektar++ allows for the selection of di↵erent schemes for di↵erent collection groups. As we shall see in the following section, this is important since each group may have vastly di↵erent performance metrics. As well as the backward transform and inner product operators already introduced, we will consider a further two key operators: • the L 2 inner product with respect to the derivatives in each direction of the basis functions: this is used to construct elemental Laplacian (sti↵ness) matrices and in other discretisations such as the discontinuous Galerkin method; • the derivative @u/@x i of a function u in each of the coordinate directions x i , calculated at each of the quadrature points.
Together, these four operators form the building blocks of a large variety of solvers, one of which we shall investigate in Section 6. However, in the following section, we present results from timing each of schemes when applied to a three-dimensional example, demonstrating how amalgamation can be used to improve runtime performance.

Results
In this section, we apply the amalgamation schemes described previously to an example mesh containing tetrahedral and prismatic elements. We demonstrate how each scheme exhibits di↵erent performance characteristics, which, when combined appropriately lead to an overall improvement in execution times across a range of polynomial orders. We examine whether the number of elements in an amalgamation group significantly impacts on the observed performance. Finally, we highlight how the choice of BLAS implementation can a↵ect the observed performance, which motivates the introduction of an auto-tuning strategy in the following section.

General performance characteristics
In our first series of tests, we aim to determine a general picture of the performance characteristics of each amalgamation scheme. We therefore examine a mesh of a rabbit aorta section, which contains two sets of intercostal artery pairs, as shown in Figure 3. This is a representative example of a typical biological flow configuration. As the inset shows, the mesh comprises a boundary layer of curvilinear prismatic elements, with tetrahedral elements 11 being used to fill the interior of the volume. The mesh contains 21,564 prismatic elements and 40,877 tetrahedral elements in total.
Our initial tests are performed across a range of polynomial orders between 1  P  8, with two collection groups being created: one for the prismatic elements and another for the tetrahedral elements respectively. For each group, we record the average execution time, measured across a number of samples, for the evaluation of each operator and amalgamation scheme outlined in the previous section.
The test hardware used is a single core of a 2.7GHz Intel Xeon E5-2697v2 CPU with 30MB of L3 cache and 192GB of RAM. A standard Debian Linux installation is used, containing gcc v4.7.2 with default -O3 optimization flags and, for our initial tests, the standard Netlib BLAS implementation [1]. The system was kept idle at the time of testing to the best of our ability to limit the e↵ects of other processes on timings. Figures 4 and 5 show the relative timings of each amalgamation scheme, normalised by the IterPerExp scheme. We see that, in general, significant performance improvements can be observed, particularly at lower polynomial orders where large speedups of up to a factor of 8 are seen. Depending on the element and operator type, the StdMat scheme generally performs best at linear and quadratic orders. However as the polynomial order is increased and the standard matrix becomes larger, this scheme rapidly becomes prohibitively expensive. As the polynomial order increases further, it is clear that the sum-factorised variants of either SumFac, IterPerExp or the baseline LocalSumFac are generally the correct scheme to select. We note that, broadly speaking, the IterPerExp and LocalSumFac methods appear to perform at similar levels. This indicates that the additional expense that is incurred by storing elemental Jacobian determinants for planar-sided elements is not substantial in comparsion to the cost of the operator. Here, we posit that the local sum-factorized matrix sizes are large enough that, regardless of whether matrices are allocated contiguously in memory or not, a local approach to sum-factorization still provides a highly-e cient implementation.

E↵ect of group size
An additional factor to consider is the number of elements in the amalgamation group. Since BLAS is known to utilise less of the peak CPU performance for smaller matrix ranks, increasing the number of elements in an amalgamation group will lead to larger matrix sizes and thus may increase performance. To this end, we repeat the previous procedure, examining the inner product operator with the IterPerExp scheme for tetrahedral elements, as they have the smallest local matrix sizes. The number of elements in the amalgamation is limited to powers of 10 until the entire mesh of tetrahedral elements is recovered. The measured execution time is then scaled to determine the equivalent time required to evaluate the action of the operator on the whole mesh. The results, depicted in Figure 6, show that with some minor exceptions, the runtime is broadly equal across the range of element sizes. Other tests performed with di↵erent BLAS implementations also did not show a significant di↵erence in performance. However, we note that this factor may be more important when considering two-dimensional elements, which will generally have smaller matrix sizes. For the purposes of the rest of this paper however, this parameter is not considered further.

Auto-tuning strategy
Whilst from the previous section we see that the performance improvements that can be obtained with amalgamation schemes are significant, it is clear that the choice of scheme in an a priori fashion is highly nontrivial. As we have seen in both this and previous work, operator count calculations alone are not su cient to give an accurate estimation of the most e cient scheme. In reality, the number of variables involved in analysing the e ciency of each scheme is highly problem-and hardware-dependent.
For example, let us consider the e↵ects of switching from the Netlib BLAS implementation to the OpenBLAS library [2], widely regarded to be highly e cient on modern hardware. To obtain a general idea of the changes in performance characteristics, we repeat the experiment of Figure 5 on tetrahedral elements, but this time using OpenBLAS, which is configured to only use a single thread. The results, which can be seen in Figure 7, demonstrate remarkably di↵erent properties. For example, the StdMat scheme is now far more e cient across the entire range of polynomial orders, with even greater speedups observed at low polynomial orders. The SumFac scheme also exhibits greater performace in the mid-to high-polynomial order. This may be due to the relatively poor performance of OpenBLAS on small matrices, such as those used in the other methods we are comparing StdMat and SumFac to. However, we see that the baseline LocalSumFac method is now widely outperformed. Changing a single parameter in our computational setup has therefore drastically altered the amalgamation schemes that would yield an optimal simulation.
The choice of BLAS implementation is far from the only parameter that can a↵ect performance. From the perspective of hardware, the choice of CPU, its cache hierarchy, bus speed and peak memory bandwidth are all important in each scheme's performance. Di↵erent software choices such as the choice of compiler, operating system and optimization settings also impact runtime behaviour. Although we may be able to observe general trends in the relative performance of each scheme, important factors such as the polynomial order crossing point, whether this extends more generally to other CPU models, and the further e↵ects   of multi-core simulations, all point towards the need for a simple method to automatically select the fastest scheme at runtime for a given problem.
To mitigate these factors, we have developed a simple and e↵ective auto-tuning strategy which can be used at runtime to automatically select the fastest amalgamation scheme. At the start of a simulation, Nektar++ creates a list of elements that lie in the three-dimensional problem mesh, as well as lists of quadrilateral and triangular elements that represent the twodimensional boundary conditions. These are then used to construct the collection groups, as defined in Section 2. We then iterate over each collection group and apply the following procedure: 1. Query a lookup table for existing performance metrics, based on the group's polynomial order, element type and number of elements in the group. All groups with 100 elements or more are identified to have the same metrics, based on the results of Figure 6. 2. If no existing metrics are found, then: • Execute each amalgamation scheme once, recording the execution time in order to estimate the number of operator evaluations that can be evaluated in one second.
• Run this number of tests against random input data.
• The scheme to select is the one with the fastest average time.
• Record this metric in the lookup table.
3. Otherwise, use the obtained metric to set the correct scheme. 4. Go to the next group and apply step 1 until all groups are processed.
The use of a lookup table, as well as an equivalence relation between groups having greater than 100 elements, means that in a typical simulation, this auto-tuning method will require less than one minute to run. This is true even for large problems that run in parallel on a HPC cluster. In the worst case, the mesh will be su ciently large that the initial operator evaluation will be greater than one second, meaning that no averaging procedure will be performed. We additionally note that occasionally, operating system kernel task management will abruptly skew observed execution times, and so a lack of averaging could lead to an incorrect indication of the optimal performance metric. However, we posit that this is unlikely in all but the most extreme cases. For scheme selection to be significantly influenced by kernel operations requires a large kernel interruption of the order of one second. In general, we regard such cases as 'extreme' as we would expect our hardware to be HPC-based: that is, composed of empty, dedicated compute nodes with very few processes running at time of execution. As a representative example, in Figure 6, the recorded runtime for each amalgamation scheme is less than 0.7s for a mesh of 40,000 elements. However, for a realistic flow simulation, in order to obtain results in a timely fashion we may use around O(10 2 ) processors depending on the polynomial order. In this case then, assuming optimal scaling of this communication-free routine, the execution time for each scheme will be less than 0.007s, allowing for 140 operator executions to be measured.

Hardware performance analysis
In order to better understand the performance observations of the di↵erent amalgamation schemes with respect to the underlying low-level hardware, we give a brief overview of metrics obtained through examination of CPU hardware counters, which give a more detailed picture of the operations being performed on the CPU. To reduce the parameter space, we now consider 1,000 prismatic elements under the actions of the physical derivative operator, which admits a simple tensor product structure, and the inner product operator, in which a more complex procedure must be adopted to account for the dependency in the tensor product indices. These tests are performed using OpenBLAS and a single core of an Intel i7-5960K 6core Haswell-architecture CPU, on a machine with 32GB RAM and the Ubuntu 15.10 Linux distribution and kernel version 4.1.12, using standard -O3 compiler optimisations with the gcc 5.2.1 C++ compiler.
In addition to the runtime, we examine some of the key system characteristics, including the memory bandwidth attained in streaming data from main memory and the cache hit ratio for the highest-level L3 cache. These metrics are obtained using the Intel Performance Hardware Counter library at appropriate points within the test program, which is executed in the same manner as the previous sections. We additionally calculate the number of floating point operations per second (FLOPS) achieved for the SumFac and StdMat schemes by examining the matrix-matrix multiplications in each scheme and taking the standard approach of estimating the floating-point operation count of a M ⇥ K and K ⇥ N matrix as 2MNK [11]. The IterPerExp and LocalSumFac schemes are omitted as they share nearly identical operation counts as SumFac. Figure 8 shows the results of this study. We see that this architecture again yields di↵erent set of runtime performance characteristics, which further emphasises the need for autotuning as part of the practical use of these schemes in real-world applications. For the most part, the fastest schemes for the inner product operator follow a somewhat similar trend, with StdMat being fastest at lower polynomial orders and SumFac fastest towards higher polynomial orders. We see that the three main factors in this performance are a combination of the operator count of the algorithm, the memory bandwidth attained and the FLOPS achieved. At lower polynomial orders, StdMat and SumFac share reasonably similar order of complexity, but StdMat is able to achieve a higher FLOPS count, peaking at around 20 GFLOPS, which represents around 30% of the peak performance of one core of this processor. As the polynomial order increases, even though StdMat achieves consistently higher FLOPS counts, the lower operator count of SumFac and higher memory bandwidth become more important in the smaller-sized matrices that SumFac uses, leading to reduced runtimes. We also observe that the IterPerExp scheme, which is designed to optimise memory throughput, shares a similar improvement in memory bandwidth over the LocalSumFac scheme and allows it to achieve better runtime performance. L3 cache performance is reasonably good across all of the schemes, with improvements over LocalSumFac in a number of polynomial orders, but in this case does not prove to be a limiting factor. We see generally similar trends in the physical derivative operator, although this time the SumFac scheme shows less of a performance increase over the other two sum-factorisation-based schemes.
These results align with the findings of other multi-core performance models, such as the roofline performance model [25] as seen in other high-order performance analysis [27]. In this model, the arithmetic intensity (i.e. number of FLOPS/byte) dictates whether performance is ultimately limited by the maximum FLOPS attainable by the processor or the memory bandwidth of the system. Our amalgamation schemes all use matrix-matrix or matrix-vector multiplications that are of a reasonably high arithmetic intensity. However, the smaller matrices arising from sum-factorisation will result in a lower arithmetic intensity. The development of multiple amalgamation schemes is therefore important, as this allows us to investigate a broader extent of the arithmetic intensity parameter space. This may therefore lead to better memory bandwidth utilisation and lower runtimes, even though the number of FLOPS does not achieve as high a level as other schemes, which we broadly see in the results of figure 8.

Application to a real example
To demonstrate the benefits of the amalgamation scheme in combination with our autotuning method, we now examine a real-world example in a HPC environment. We consider the compressible Euler equations, and p being the pressure. An equation of state is needed to close the system. In this case, we use the ideal gas law p = ⇢RT where T is the temperature and R is the gas constant.
To discretise this hyperbolic system, we adopt a discontinuous Galerkin method, which is well-suited in this setting due to its local conservation properties. Specific details on the discretisation are available in [12] and on its implementation in Nektar++ in [3,8]. However, we note that the discretisation makes heavy use of the inner product operator with respect to the derivatives of the basis functions, since the weak form involves the calculation of the volume flux term Z ⌦ e rṽ · F(ũ) dx for a test functionṽ composed of polynomial functions. We consider the flow over an ONERA M6 swept wing, a common aeronautics test case [21], at a freestream Mach number M 1 = 0.4. The distribution of the Mach number across the wing surface and symmetry plane, as the flow is evolved to a steady state, is visualised in Figure 9. A curvilinear tetrahedral mesh of 147,805 elements represents the underlying geometry, generated through the use of an elastic analogy described in [17]. The flow is evaluated using a fully-explicit four-stage 4th-order Runge-Kutta scheme with a timestep of t = 10 7 at a uniform polynomial order of P = 2. A HLLC Riemann solver is used to solve the one-dimensional Riemann problem arising at elemental interfaces.
To run this simulation, containing around 1.5m degrees of freedom per conserved variable, requires the use of larger scale computing resources. We consider two machines of varying capabilities. The first, cx2, is located within Imperial College HPC facilities. We use 16 nodes containing two 2.93GHz, 6-core Intel Xeon X5670 CPUs with 12MB L3 cache each and 24GB of RAM, connected over an Infiniband interconnect, for a total of 192 cores. The second is ARCHER, the UK national supercomputer located at EPCC at the University of Edinburgh. This is a Cray XC30 MPP machine, comprising of nodes containing two 2.7 GHz, 12-core Intel Xeon E5-2697 v2 (Ivy Bridge) processors with 64 GB of RAM and connected via a Cray Aries interconnect. We use 20 nodes for our experiment, giving a total of 960 computing cores. We note that the same mesh and polynomial order is used for each system.  Table 1 shows the timings obtained from the auto-tuning method on the root process, which executed at solver setup in the span of around 20 seconds. We note that each processor runs its own sets of timings and this should only therefore be seen as a representative example. The autotuning method is run across all nodes simultaneously, with each processor selecting its own fastest observed time. We deliberately choose this approach, as opposed to, for example, the root process dictating the scheme, since mixed element meshes will lead to partitions requiring di↵erent scheme selection. The table shows the timings for each method, with the fastest highlighted in bold. We see that typically the StdMat operator is preferred, which corresponds to the general trends seen in Section 3 for this polynomial order. Indeed, over the LocalSumFac baseline scheme, we see that generally the amalgamation schemes perform extremely well, being almost an order of magnitude faster in the case of cx2.
To see the e↵ect of each scheme on the overall runtime, we now measure the average wall-time needed per timestep over 5,000 samples. This timing therefore incorporates all other aspects of the compressible Euler solver, such as the Riemann solver and communication costs. Table 2 shows the measured wall-time for each of the machines. The use of our amalgamtion schemes, in combination with the auto-tuning method, results in a reduction in overall runtime of around 40% for cx2 and 60% for ARCHER. This is a considerable improvement in execution time and shows the potential advantages that can be achieved when using amalgamation.

Conclusions
In this work, we have presented a methodology for amalgamating the action of various key finite element operators across a range of elements. The resulting amalgamation schemes demonstrate improved performance due to their more e cient use of data locality and reduction in data transfer across the memory bus, enabling increased performance through exploiting optimised BLAS routines and the CPU cache structure. An auto-tuning method was presented, enabling the automatic selection of the most e cient scheme at runtime. We have shown how these schemes can be leveraged to improve runtimes, both by examinining the schemes individually and by applying them to a large-scale simulation of the compressible Euler equations. The results clearly demonstrate the importance and benefits of streaming data from memory e ciently.
As alluded to in the introduction, we stress that the results shown here are generally not specific to the spectral/hp element method, due to the fundamental nature of the operators being used. Other high-order schemes, such as the popular nodal discontinuous Galerkin method [12], rely on the evaluation of the same types of operators, which in turn have similar matrix formulations. However, we note that the SumFac scheme may not be applicable, depending on the choice of basis functions used in the local expansion of each element. As we describe in Section 2, sum-factorisation relies on the ability to write local expansion modes as the tensor product of one-dimensional functions. In the nodal DG scheme, hybrid elements such as prisms and tetrahedra typically use Lagrange interpolants together with a set of suitable solution points, such as Fekete or electrostatic point distributions. This choice of basis functions are inherently non-tensor-product based, and so the SumFac schemes we consider here cannot therefore be utilised for these element types. However, the IterPerExp and StdMat schemes are both equally applicable in this setting.
Possible routes for further work could focus around improvements to the autotuning method, which may not be well-suited for a fully hp-adaptive simulation. Other kernel implementations that are specifically designed to optimise performance tailored to particular operators, such as the approach investigated in [27] that generate matrix-specific code based on the sparsity of the standard matrix, could also prove useful in furthering the performance of this work. We also note that in general, the performance gains presented in Section 6 may not be achievable in other solvers. In particular, where the problem is either fully implicit or semi-implicit, a large proportion of the overall runtime is usually spent inverting the global matrix system. In parallel this is done iteratively, with the action of the global matrix represented through a block-diagonal elemental matrix and an approprate gather-scatter operation. Future studies on this topic should therefore focus on making this process more memory e cient in combination with our amalgamation schemes.