Fast native-MATLAB stiﬀness assembly for SIPG linear elasticity

When written in MATLAB the ﬁnite element method (FEM) can be implemented quickly and with signiﬁcantly fewer lines, when compared to compiled code. MATLAB is an attractive environment for generating bespoke routines for scientiﬁc computation as it contains a library of easily accessible inbuilt functions, eﬀective debugging tools and a simple syntax for generating scripts. However, there is a general view that MATLAB is too ineﬃcient for the analysis of large problems. Here this preconception is challenged by detailing a vectorised and blocked algorithm for the global stiﬀness matrix computation of the symmetric interior penally discontinuous Galerkin (SIPG) FEM. The major diﬀerence between the computation of the global stiﬀness matrix for SIPG and conventional continuous Galerkin approximations is the requirement to evaluate inter-element face terms, this signiﬁcantly increases the computational eﬀort. This paper focuses on the face integrals as they dominate the computation time and have not been addressed in the existing literature. Unlike existing optimised ﬁnite element algorithms available in the literature the paper makes use of only native MATLAB functionality and is compatible with Octave GNU. The algorithm is primarily described for 2D analysis for meshes with homogeneous element type and polynomial order. The same structure is also applied to, and results presented for, a 3D analysis. For problem sizes of 10 6 degrees of freedom


Introduction
Finite element analysis (FEA) is commonly used as a technique for solving partial differential equations by engineers, mathematicians and scientists.The MATLAB environment, with its library of functions and debugging procedures, allows bespoke FEA routines to be generated quickly with few lines.Examples include Coombs et al. [1], Sigmund [2] and others.However, an unoptimised MATLAB script will often run significantly slower than unoptimised compiled code [3].This paper demonstrates how the advantage of using only native MATLAB to generate FEA routines is not necessarily penalised with slow run times when written in an optimised form for the symmetric interior penalty Galerkin (SIPG) method.
Significant progress on optimising FEA routines in MATLAB was achieved by Dabrowksi et al. [3] in 2008.The authors presented MILAMIN, an open source optimised non-native MATLAB implementation of continuous Galerkin (CG) FEA code that is capable of setting up, solving, and post processing 2D unstructured mesh problems with 10 6 degrees of freedom (DOF) in under a minute.One common method to compute the global stiffness matrix is to compute each local element matrix in turn through a series of small matrix multiplications.When creating the MILAMIN algorithm the authors recognised that there were two significant bottlenecks with this method.Firstly, two nested for loops are required to generate all the element stiffness matrices in a mesh.
The outer loop, to loop through all the elements and the inner loop, to loop through all the Gauss points.As MATLAB loops are inherently slow and the iteration number of the element loop is big when calculating the stiffness matrix for a large mesh (excess of 10 6 DOF) this was recognised as the first bottleneck.
The second bottleneck was recognised as the time required to transfer data between the RAM and the CPU cache; this time was significantly larger than the calculation in the CPU itself -even for large calculations [3].Matrix calculations in MATLAB are performed by the Linear Algebra Package (LAPACK) which calls the Basic Linear Algebra Subprogramme (BLAS) package.In conventional FEA codes the BLAS package is called for every Gauss point for each element individually making the total transfer time significant.
Dabrowski et al. [3] removed both bottlenecks by designing an algorithm where an entry in a local stiffness matrix could be computed for all elements simultaneously.Their size was consequently reduced.As an entry is calculated for all elements simultaneously the number of BLAS calls is proportional to the number of entries in the local element stiffness matrix, rather than the number of elements in the mesh.The number of BLAS calls is therefore in general smaller and no longer dependent on the size of the problem, the data transfer time is subsequently minimised removing the second bottle neck.The MILAMIN routine was further improved by maximising cache reuse, a technique known as blocking.This work has since been extended by introducing parallel vectorised stiffness matrix calculations in [4].
More recently Rahman and Valdman [5] produced a fast MATLAB script for a volumetric integral of elements with linear nodal shape functions.The focus was to start with a non-vectorised code with a standard finite element structure and then improve its computational speed through vectorisation.One of the key characteristics was to preserve the code's original structure, this ensured that the readability was not lost which is often the case in code optimisation.Lack of readability in optimised codes was also highlighted by Dabrowksi et al. [3].Additionally, Anjam and Valdman [6] produced a vectorised MATLAB script for Raviart-Thomas elements used in discretizations of H(div) spaces and Nédélec elements in discretizations of H(curl) space.Andreassen et al. [7] provided a comparison and discussion of computational performance between different vector computational languages to assemble a FE global stiffness matrix.Cuvelier el al. [8,9] presented a more general approach to vectorise routines for multiple vector languages.
In a FEA code once all the local element stiffness matrices have been calculated they are assembled together to form a sparse global stiffness matrix.In native MATLAB this is achieved using the command sparse which generates a sparse matrix from triplets of data: row position, column position and the associated value.The native performance is slow, Dabrowski et al. [3] used sparse2 a non-native MATLAB command.Other sparse matrix commands for MATLAB have also been created, investigated, improved and discussed in [10], who also provide their own improvement and GPU implementation.
In this paper the SIPG method for linear elasticity is implemented.Discontinuous Galerkin (DG) methods were first introduced by Reed et al. [11] for solving the neutron transport equation.Richter [12] prompted an extension of the original DG method to elliptical problems including linear convective-diffusion terms.However, the discontinuous approximation was only applied for the convective terms, with mixed methods for the second-order elliptic operators.Bassi and Rebay [13] introduced the complete discontinuous approximations for both the convective and second-order elliptical operators.
One arising characteristic of DG methods is that the degrees of freedom are element specific, allowing simple communication at the element interfaces.Specifically, hp-refinement is simplified due to its capability to incorporate hanging nodes at the element interfaces.These qualities make the DG method very suitable for efficient adaptive refinement to achieve high fidelity simulations [14].
The penalty for allowing this flexibility is that the number of terms to be integrated in the week form and degrees of freedom is higher for the same number and type of elements when compared to the CG method.The additional integrals are face connectivity stiffness terms which couple the unshared degrees of freedom between elements.This increases the number of calculations required to produce the global stiffness matrix, K [15], the need for efficient production of the K matrix is therefore necessary even for relatively small problems.This paper extends the algorithm presented by Dabrowski et al. [3] to include optimised integration of the face terms for SIPG, [16], for linear elastic problems in a vectorised blocked form.In this paper all the algorithms are designed for native MATLAB functionality only, a clean departure from the majority of the optimised MATLAB algorithms available in literature [3,5,4].The only other known vectorised, non-blocked, MATLAB code on DG methods is by Frank et al. [17].The authors in [17] consider the time dependent diffusion equation as their model problem, cast within a local DG formulation in 2D.Here we design a block vectorised code in native MATLAB, which exploits the symmetry in SIPG, to model linearly elastic problem in 2D and 3D.
The paper begins with a brief overview of the SIPG formulation for linear elasticity followed by a reformulation into a matrix form that can be computed in a vectorised algorithm in Section 2. The vectorised algorithm for computing SIPG face stiffness terms is presented and discussed in Section 3, the volume integral is omitted as it is thoroughly covered in [3].This is followed by a discussion on: generating the local face stiffness matrices, efficiently generating global variables, Gauss quadrature on faces and sparse storage of the local stiffness matrices into the global stiffness matrix.In Section 3 the Linear2D DG.m script is explained, with the full code available at [18].Timing results, validation, and discussions are presented in Section 4, followed by a conclusion in Section 5.

Optimising the DG method
g D and g N are data in [L 2 (Ω)] d , they are respectively the applied Dirichlet and Neumann boundary conditions.The Cauchy stress tensor is defined as , where ψ is the free energy function for hyperelasticity, ε is small strain, u is displacement and n is the normal unit vector to the boundary.
The Cauchy stress tensor can also be described σ = Dε(u) where D is a material stiffness tensor relating stress and strain.
This paper provides only a description of the 2D optimised code, therefore a description of the 3D element spaces and respective mesh is omitted.
The polygonal finite element mesh T is homogeneous in element type and is in general unstructured.Two element types are defined here, the triangle and quadrilateral, however since only one element type is present in a mesh both types are referred to as K.The polygonal mesh T is comprised of elements K which are either the image of the reference triangle or quadrilateral under an affine elemental mapping F K : K → K.The homogeneous discontinuous Galerkin finite element space for triangle elements is defined as the space of polynomials on K of degree less than or equal to 1 and Q p (K) is the space of polynomials on K less or equal to p in each dimension.
We denote by F(K) the set of the three elemental faces for the triangle, or as the set of the four elemental faces for the quadrilateral, of an element K.If the intersection F = ∂K + ∩ ∂K − of two elements K + , K − ∈ T is a segment, we call F an interior face of T .The set of all interior faces is denoted by F I (T ).
Analogously, if the intersection F = ∂K ∩ ∂Ω of an element K ∈ T and ∂Ω is a segment, we call F a boundary face of T .
The SIPG method for the approximation of the model problem ( 1) is now introduced in the bilinear form where the homogeneous Dirichlet boundary conditions on ∂Ω D are applied strongly.Find the displacement u h ∈ W p (T ) such that a(u h , w) = l(w) for all w ∈ W p (T ), where and β is a penalty term for linear elastic SIPG defined in [19], h f is this size of an element face, and where the element faces of K + and K − on an intersection F ∈ F I (T ) are respectively referred to as F + and F − .Additionally for convenience (•, •) and •, • are used, where (a, b) Ω = Ω ab and a, b ∂Ω = ∂Ω ab.

Matrix form of the SIPG method
Now that the weak form of the problem has been described it is possible to express the stress, strain and displacements in (2) as function of nodal displacements, shape functions and their derivatives, and material stiffness.Once expressed, each term in the bilinear form can be reformulated as a set of matrix multiplications which can be used to compute the stiffness matrix for SIPG.
The first step to achieving the matrix formulation is decomposing the element displacements u h into a matrix of element shape functions N n and their corresponding coefficients u n such that u h = N n u n where u n = [u 1 , v 1 , . . ., u nen , v nen ] T , nen is the number of element nodes, and, u and v are respectively the displacements in the x and y directions of the Cartesian coordinate system.Similarly the test function can be represented as w = N n w n .
As the small strain tensor is a function of u h , the strain can also be expressed as a set of matrix multiplications ε = LN n u n with the additional term as the small strain partial differential matrix operator [15].From hyperelasticity the Cauchy stress is simply expressed as σ = Dε = DLN n u n , where D is the plane stress or strain stiffness matrix.Substituting the matrix forms of the stress, strain and displacement into (2), and setting gives, The test function term in the Neumann boundary condition (3) is also expressed as a matrix multiplication Each term in the bilinear form can now be reformulated into a set of matrix multiplications by setting (10) equal to (9), multiplying out the brackets and dividing by w n to give where (K K +K F ) is the global stiffness matrix, K, comprised of the global element stiffness matrix and the face stiffness matrix respectively.U n is vector containing all the nodal displacements for all K ∈ T such that with respect to the mesh topology U n = K∈T u n (K).The remaining terms in (11) in their full form are The superscripts + and − in equations (12) to (23) correspond to variables existing respectively in K + and K − .The variable n + is a matrix of normal components to F + , its form for the SIPG linear elastic 2D problem is Last the set M is defined as which is the set of partial stiffness matrices which when integrated over a single face F , and assembled together with respect to the element topology of K + and K − , produce the local SIPG face stiffness matrix K LF for the face F .

Vectorising the SIPG face integration
In Section 2.2 the matrix formulation of the SIPG method was expressed in (11).Traditionally from here the SIPG local face stiffness matrices are produced by computing the local stiffness for each element, K, and face, F , individually.
However the size of the loops in such an algorithm is proportional to the size of the problem, i.e. the number of elements and faces.As loops in MATLAB are significantly slower than compiled code an algorithm with this structure is unacceptable to use for large problems [3].The approach used in this paper to speed up the computation of the global stiffness is to reformulate each partial stiffness matrix in (11) so that each matrix can be computed in a vectorised blocked algorithm.
A vectorised calculation is where multiple results for a scalar equation are calculated simultaneously.This is achieved by providing the inputs to a scalar equation as vectors and only performing entry-wise operations in the code.The current form of the partial stiffness matrices in (11) can not integrated in a vectorised algorithm since the result of each partial stiffness matrix can only be found through matrix operations.To calculate the integral of a partial stiffness matrix in a vectorised algorithm the matrices in M need to multiplied out to give a resultant single matrix with entries comprising of only scalar equations.
This allows an entry in a matrix to be integrated for all faces simultaneously.This removes the necessity to have a loop, that loops over all faces.The result is the size of the for loops in the algorithm are no longer dependent on the size of the problem.However, the Gauss point integration loop still exists, additionally two more loops are added to loop over the local nodal element combinations.These three loops are not dependent on the size of the problem and in general, expect for small problems, smaller in comparison to the number of elements in the mesh.Therefore the speed up provided by having these loops to allow vectorisation for large problems is significantly more than the loss of speed inherent with MATLAB loops.
The method for reformulating each partial stiffness matrix in equations (12) to ( 23) is the same, here the integral matrix term M C2 from ( 13) is used as an example The vector u − is omitted in the integral as it is the solution to the linear elastic problem and so is unknown.
To reformulate M C2 the shape functions and the derivatives, N − and B + , are expanded into their full form so M C2 becomes, The form of B + and N − are repeated down the rows and along the columns respectively so M C2 can therefore be rewritten in the condensed form where i and j are respectively the local finite element nodes numbers for elements K + and K − who's shape functions pre-and-post multiplied M C2 .The material stiffness matrix D is either acting in plane strain or stress and so is represented as When multiplied out (27) becomes .
M r C2 ≡ M C2 , however for the sake of clarity the reduced 2-by-2 matrix form of M C2 is redefined.An equivalent matrix exists for all the partial stiffness matrices in M .The new set M r is now defined and contains the equivalent 2-by-2 matrix forms of matrices in the set M , denoted with the superscript r, All the entries in M C2 are now represented by 4 scalar equations which are looped over the indices (i,j).

Code Assembly
The complete code layout is summarised by Algorithm 1, and correlates to lines of Linear2D DG.m, the optimised SIPG .m script provided by [18].The algorithm contains three stages: 1. Area integral: lines 1-5.MATLAB code: lines 22-64.
In stage 2 the SIPG face integral computes the local face stiffness matrix K LF for all faces in the mesh.Stage 2 dominates the number of lines in the code due to having 12 terms to evaluate rather than just one like in stage 1 (11).In stage 3 the local face stiffness matrices are assembled into a sparse global stiffness matrix completing the algorithm.The optimised SIPG area integral is not discussed as it is identical to the optimised CG area integral in [3] expect that the elements do share degrees of freedom.This paper focuses on the novel implementation of the blocked vectorised integration of the partial stiffness matrices in M to produce the global face stiffness matrix K F .When considering the fast vectorised computation of the SIPG face terms in (11) there are six main aspects to address: Optimising the CPU cache reuse (blocking), the structure of the vectorised algorithm, memory allocation, reducing the number of BLAS operations, generating variables for the blocked algorithm, and the reference Gauss point locations.In the following sections each of these points will be addressed in turn.Section 2.3 demonstrated that the matrices in M could be represented by a repeated 2-by-2 matrix of scalar equations looped over the nodal indices (i,j).for Area Gauss blocks do for Gauss point loop do 8: SIPG face integral set-up: Figure 3 9: SIPG face integral: Figure 1 10: end for 11: end for 12: Sparse storage: Figure 5 Arranging the partial stiffness matrices in M into their equivalent form in M r means that each partial stiffness matrix is constructed from entries, each of which are scalar equations that are applicable to all finite elements in the mesh.
Reformulating the matrices in M into the form in M r allows the integral of each entry in the partial stiffness to be computed for all faces simultaneously in a vectorised algorithm; the schematic for such an algorithm in MATLAB is represented in Figure 1.The following subsections use the matrix M r C2 as an example.

Blocking
When the CPU performs a BLAS operation the best performance is achieved when all the data required for the operation resides in the lowest level of cache, as this is fastest accessed.However when the data size is too large to reside entirely in the cache, sections are stored on higher levels of CPU cache or the RAM, both of which are slower to access.The technique for maximising the vector size, with the condition that the data for a BLAS operation resides in the cache memory, is called blocking.A vector integral calculation for an entry in a partial stiffness matrix can exceed the CPU cache size.In the case where the cache memory is exceeded the set of faces F(T ) is split into blocks of faces, defined as SIPG face blocks.The vector calculation for a matrix entry is now performed for each block in turn so the cache reuse is maximised, reducing the overall run time.This process is dictated by the for loop on line 2, Figure 1, which runs through all the SIPG face blocks in the mesh.

Structure
The structure of the algorithm which generates the SIPG local face stiffness matrix is described in Figure 1.It is characterised by four for loops appearing on lines 2, 3, 6, and 11.The first loop, loops through all the SIPG face blocks.The second loop is the Gauss point loop which numerically integrates the partial stiffness matrices in M r to generate the local face stiffness matrix (11) for all faces in the SIPG face block.The final two loops go through all nodal combinations (i, j) in the matrices of the set M r which when integrated form the local face stiffness matrices in (11).

Reducing the number of BLAS operations
It is possible to take advantage of the structure of the matrices in the set M r to reduce the number of BLAS operations.As an example, the entry (1,1) of M r C2 can be split into two components; a component which varies with row number i represented in Figure 1 as C2t 11 on line 9, and one with column number j.All entries of M r C1 , M r C2 , M r C3 , and M r C4 can be split into two components in the same way.The component which is a function of i requires more BLAS operations and is calculated outside the inner node loop (line 9), Figure 1.The i component is then multiplied with the j component and added to glob pn on line 14.This reduces the number of BLAS calls, the computation time, and time associated with calling the routine.
An equivalent M r D2 matrix can be constructed for M D2 , (11).Unlike M r C2 , the components of entries in M r D2 which are a function of column number require more BLAS operations than those which are a function of row number.As an example, the entry (1,2) of M r D2 can be split into two components; a component which varies with column number j2 represented in Figure 1 as D2t 12 on line 10, and one with row number i2.The column index j2 is defined on (line 7) and the row index i2 is defined on (line 12), Figure 1.The multiplication of the column and row dependent variables, as with M r C2 , still occurs in the inner loop (line 14), Figure 1.Equivalently all entries of M r D1 , M r D2 , M r D3 , and M r D4 can be split into two components.The number of BLAS calls can be reduced further by utilising the symmetry of the global stiffness matrix so that only the upper triangular components of the local stiffness matrices need to be calculated.The for loop nodal indices (lines 6 and 11 of Figure 1) are therefore restricted to this part of the matrix.
However, in order to keep the size of the node index loops the same between M r C2 and M r D2 , j2 and i2 of M r D2 are looped through in reverse order (lines 7 and 12 of Figure 1).Lastly, the MATLAB indices j2 and i2 refer to variables in M r D1 , M r D2 , M r D3 and M r D4 , and the, i and j, indices refer to variables in and M r E4 .The loop indices, i and j, correspond to the node number of elements in the local matrix.The assembly of all matrices in M for a face F results in a symmetric matrix, therefore only the upper triangular entries of each matrix in M need to be computed, reducing the number of BLAS calls.For a nodal combination (i,j) the partial stiffness matrices in M r will provide a two-by-two matrix for the degrees of freedom that exist at these nodes.When considering nodes on the leading diagonal, i.e. when i==j, only the upper triangular components of the matrices in M r are required.An if statement is present (line 15 of Figure 1) so all the entries in a matrix of M r are computed only if i<j, and the lower triangular components are omitted if i==j.
To complete the global stiffness matrix formulation, the transpose of the global matrix is added to itself.To avoid doubling values on the leading diagonals of the local matrix, diagonal terms of the M r matrix are divided by 2 when i==j by half(i,j).half(i,j) is a simple script added which returns a value 0.5 if i==j and 1 otherwise.

Memory allocation
Memory for variables which increase in size during the nodal loops are preallocated, this prevents reallocation of the variables on the RAM which reduces the run time.The partial stiffness matrices in M r can be split into four sets.
Each set can be summed together to form For the faces F all the components of a set, for example s = 1 , The variable glob pn is a three dimensional array; the first dimension corresponds to the face numbers in the SIPG block, the second and third dimensions respectively correspond to the degrees of freedom of the finite element that pre-and-post-multiplied the partial stiffness matrix.As an example the local degrees of freedom of the entry (1,1) of M r C2 are a function of node numbers i and j.The degrees of freedom are provided by the variables Ai and Aj, they are used to steer the entry (1,1) into the appropriate second and third dimension of glob pn (line 14 of Figure 1).Equivalently entry (1,2) of M r D2 is a function of the node numbers i2 and j2.Here the degree of freedom numbers Bi and Bj+1 store entry (1,2) into the appropriate position in glob pn (line 16 of Figure 1).

Generating variables for blocked algorithm
When integrating an entry of a partial stiffness matrix simultaneously for multiple SIPG faces, the shape function and their derivatives for the scalar equation for that entry must be in vector form.To compute a local SIPG face stiffness matrix, K LF , information is required from both the K + and K − elements sharing a face.During mesh generation the face connectivity for all faces in the mesh F(T ) are stored in the face connectivity matrix etpl face where a column correlates to face number in F(T ), Figure 2b.
The script represented in Figure 3 runs on line 5 of Figure 1.The true representation of first two for loops, the block and Gauss point integration loop (lines 1 and 4 of Figure 3), is in Figure 1, however these loops are shown in Figure 3  The rows of dNx p, and dNy p, correspond to the same ordering of elements in etpl fac block(:,1).The columns correspond to a shape function number which is selected by i, j, i2 or j2 in Figure 1.Their computation occurs in several large matrix operations.First the Jacobian components Jxp and Jyp are computed on lines 10-11, through dNr(indx dNr(fn,gp),:) where the subscript nf b corresponds to the number of DG faces in the block.
The Jacobian determinant and its inverse are computed in an explicit manner on lines 12-14.The global shape function derivatives for the current face, fn, are calculated on lines 15-16, using The result is stored into dNx p and dNy p with index p, this ensures the element ordering remains consistent with etpl face block.
The shape functions are only dependent on their local position and therefore local value Nr.The values are stored into the matrix Np, again with index p to ensure consistent element ordering with etpl face block.
The algorithm in Figure 3 is only applicable to K + elements but with a few simple changes can be for K − elements: line 6 change etpl face block(:,3) to etpl face block(:,4), line 8 change etpl face block fn(:,1) to etpl face block fn(:,2), line 15 and 16 change dNx p and dNy p to dNx n and dNy n and lastly line 17 change Np to Nn. Gauss points along a face for K − elements are considered in reverse order so that they align with K + Gauss points in the global domain; MATLAB code: lines 153-154.

Reference Gauss point locations
and

Sparse Storage
The summation of all local face stiffness matrices forms a global stiffness matrix, K F in (11).The global numbering for the degrees of freedom along the rows and columns of the local face matrices correspond to their row and column position in the global face stiffness matrix K F .
In Figure 1 glob 2 stores all components of G 2 from (31).Equivalent storage variables exist for the remaining subscripts see Table 3.To store glob 2 into a

Partial stiffness matrices
Face term row column neg i neg j   Bi-linear quadrilateral 4 2 Bi-quadratic quadrilateral 9 3 Table 5: The number of Gauss points required for the area and face integral for different element types.
of finite elements that post-multiplied.
After all the stiffness matrices are stored into the global sparse matrix, the sparse matrix is transposed and summated (line 14), completing the global stiffness matrix.

Blocking and numerical analysis
This section demonstrates the efficiency gain obtained when using vectorised blocked scripts to generate all the SIPG local face stiffness matrices (11).All computations were performed in a native MATLAB environment using double precision float accuracy, the backward slash operator '\' is used to solve any linear system of equations.The .m file was run from a terminal using MATLAB rather than from the MATLAB GUI.All meshes were structured and homogeneous in element type, they were constructed from either, four noded bi-linear quadrilateral elements with linear basis functions in each direction, eight noded bi-quadratic quadrilateral elements with quadratic basis functions in each direction, or three noded constant strain triangular elements.For all elements the degrees of freedom existed on the nodes.The number of Gauss points for the area and face integral is displayed in Table 5.
Timing experiments on computers are susceptible to a lack of precision and accuracy, this is caused by both the computer performing background tasks and components fluctuating in temperature.When testing a range of SIPG face block sizes, the order of the block sizes was randomised and tested, this process was repeated.
faster.All calculations, if possible, were therefore made to occur in this format.

Optimum block size
There is an optimum size of vector for an element-wise vector calculation which achieves the most floating point operations per second (flops).Managing element-wise vector operations of lengths larger than the optimum size into smaller sizes, of optimum length, is a technique known as blocking.The vectorised SIPG code described in this paper is designed to blocked.If the blocking algorithm for the SIPG code is effective a peak in performance corresponding to the optimum vector calculation length is expected.This section investigates whether the code has an optimum vector length, what the length is, and the number of flops achieved for this length.
To determine the performance of the optimised SIPG code, the time to shown in Table 6.
Computer 1 and 2 have a respective theoretical peak performance of 3.04 × 10 9 and 5.7 × 10 9 double precision floating point operations per second (flops).
These peaks correlate to the fastest computation times achieved, shown in Table 7, and thus as their time is smallest are the optimum block sizes to perform the 2D SIPG area and face integrals.
Figure 8 shows computer 1's fastest performance to generate the area and face integral was ≈ 500 Mflops for a block size of ≈ 3×10  tal efficiency of the theoretical peak performance.Whereas computer 2 achieved a higher ≈ 1000 Mflops for both the area and face integral corresponding to an efficiency of 17.5%.As an optimum block size was achieved for the both the area and face integral Figure 8 demonstrates a correct implementation of a vectorised blocked algorithm to compute the SIPG global stiffness matrix with comparison to [3].The algorithm worked correctly on both computer architectures.
In comparison to an unoptimised code.Computer 1 took respectively 10.2 s and 21.2 s to compute the area and face integrals, whereas Computer 2 took 5.9 s and 17.91 s.Comparing the speed of the optimised code in Table 7 to the unoptimised code, computer 1 achieved a speed increase of 51 times for the area integral and 20 times for the face integral with a total speed increase of 24 times.The total speed increase from pure vectorisation is 13.7 times with blocking being 1.8 times faster than pure vectorisation.Computer 2 achieved a speed increase of 54 times for the area integral and 39 times for the face integral with a total speed increase of 41 times.The total speed increase from pure vectorisation is 23 times blocking being 1.8 times faster than pure vectorisation.
It is noted that for both computer architectures in Figure 8 that the block Mflop performance is still decreasing when the block size exceeds that of the number of area integrals and face integrals.This suggest that for larger problems the performance gain from block is going to be more substantial.
At the peak performance the cache reuse is maximised.After the peak the proportion of data lying outside the lowest level of CPU becomes larger and so the performance decreases.The performance is still decreasing as the block size exceeds the number of area and face integrals, marked by the vertical lines on Figure 8.This indicates that for larger problems the advantage of blocking over purely vectorised code is going to become larger but also indicates the cache reuse is being maximised.
Computer 1's cache is larger than computer 2, as larger variables can reside in the lowest level cache the optimum block size for peak performance is therefore also larger.This can also be seen in Figure 7.
A speed run on Computer 2 using Octave version 3.8.2 was also performed to verify that the code was effective in both MATLAB and Octave.Figure 8 demonstrated that a peak in performance was achieved for both the area and face integral.Similar to the tests performed in MATLAB, once the peak was reached the performance continued to decrease and only stabilised once the block size exceed the number of elements and faces.The optimal performance, in comparison to MATLAB, was also slower with the area and face integrals corresponding to a loss in performance of ≈ 2.1 and ≈ 3.82 times.Despite being slower, the speed up for the vectorised blocked when compared to an unoptimised code for the area and face integral was ≈ 113 and ≈ 41 times, much greater than that achieved with MATLAB.
The sparse formulation time was ≈ 15s for computer 1 and ≈ 9 s for computer 2. A small investigation into whether blocking arrays whilst using native MATLAB function sparse had any influence on the storage time, but it was found that only with computers with limited ram, (4Gb), yielded any performance improvement.As highlight in [3,10] using non-native MATLAB variations of the sparse command can significantly increase performance.

Algorithm validation
To validate the correct implementation of FEA code for a linear set of equations, an eigenvalue convergence test can be performed.The test involves a 2D domain, Ω, where x, y ∈ [0, 1].The mesh distribution within Ω is structured with each element having the same area.Homogeneous Dirichlet boundary conditions are applied on ∂Ω ≡ ∂Ω D .The stiffness material matrix (28) is unusually defined as A = 1, and, B and C = 0.This uncouples the degrees of freedom so that the stiffness matrix contains two 2D Poisson problems added together with a doubled mass coefficient.The smallest eigenvalue of the problem is π 2 .
An undamped dynamic system of equations for linear elasticity modelled using SIPG is where K is the global stiffness matrix, A is the mass matrix, [15], and λ 2 1 is the first natural frequency squared.The computeted convergence rates were close to the analytical convergence rates for all elements, as shown in Figure 9, [14].

Hole in plate verification
To demonstrate that the optimised 2D DG algorithm converges to correct solutions for linearly elastic problems, as well as to demonstate the perfomance gains using an optimised SIPG code, a plane stress analysis of an infinite plate with a hole at its centre subjected to a uniaxial tensile stress is now considered, [22].Here the performance of an optimised area integral as present by [3] is also analysed in conjunction with the optimised SIPG face integral presented here.
The solution to the infinite problem is provided by [23].The infinite problem is truncated at the boundary by using the stress solution as Neumann boundary conditions on ∂Ω.The reduced problem setup is provided by Figure 10.The analytical stress solution is and, where θ and r are polar coordinates and a is the hole radius see Figure 10.
A schematic of the problem is shown in Figure 10, with sides of length l = 10 m and hole radius a = 1 m.The material is modelled in plane stress with a Young's modulus of 10 3 Pa, Poisson's ratio of 0.2, with an applied far field stress of σ ∞ = 10 2 .
For this problem all three element types are used: The constant strain triangle, the bi-linear linear quadrilateral, and the bi-quadratic quadrilateral.An example of their respective meshes is shown in Figures 11a and 11b.For the constant strain triangle element and bi-linear quadrilateral element the meshes have the same number of nodes along the radius and the circumference.The biquadratic quadrilateral element has the same number of element vertex nodes along the circumference and radius.Along the circumference the nodes are equally spaced in terms of θ.Along the radius, r, a scaling factor, sf , is applied to prevent distorted elements.The ||u − u h || L 2 (Ω) error between the analytical solution for the displacement u, [23], and the computed displacement u h is calculated from This error is used in Figures 12a and 12b to validate the convergence rates for different element types, [14], as well as to compare performance gains between the optimised and non-optimised codes.
Convergences rates of 2.1, 2.0 and 4.1 were achieved for the constant strain triangle, and for the linear and quadratic quadrilaterals using the optimised SIPG code which are very similar to their analytical counterparts 2, 2 and 4, [14].This demonstrates correct implementation of the optimised code for multiple elements types for a linear elastic problem.
For the speed investigation computer 1 was used, the block size of 3 × 10 4 was used for all computations.For all elements the performance gain of the optimised code against the non-optimised code improves with problem size, Figure 12a.The initial poor performance improvement was because at a low element number the number of BLAS calls was similar for both codes.The highest performance gain of ≈ 90 was achieved by the order 1 triangle elements, the lowest performance gain of ≈ 9 was achieved by the order bi-quadratic quadrilateral elements.For both quadrilateral elements the rate of performance gain with problem decreases.For Triangular elements the rate of performance gain remains constant.
The ratio between the number of matrix calculation BLAS calls between the optimised and non optimised codes is similar to that of the performance gain for large problems.However for the triangular element this correlation breaks down.For a ||u − u h || L 2 (Ω) ≈ 3 × 10 −2 the BLAS call ratios for the area and surface are 24.3,and, 12 respectively.This is far below the performance gain in Figure 12a.This is likely due to the optimised triangle BLAS call number of approximately 500 where as for the quadrilateral optimised, and all unoptimised, codes BLAS calls exceed 5000.There is no correlation between the computational time and BLAS call number, this would indicate that the speed of the optimised for the triangle codes is longer dictated by the BLAS overhead, whereas the optimised quadrilateral codes are.

3D verification
Here a unit sided cube exist in a reference 3D Cartesian coordinate system where [x, y, z] ∈ R 3 .The cube is modelled using the SIPG method described in Section 2.1.Roller boundary conditions exist on all faces except where z = 1; here a displacement of dw = −0.25 m is applied in the z direction.The material has a Young's modulus of 1 Pa and Poisson's ratio of 0.2.
3.9 s.The total runtime to generate the global stiffness matrix was 29.3 s and 20.2 s for computer 1 and 2. Profiling reviled the MATLAB function sparse, necessary to generate the global stiffness matrix, took the majority of the time 18 s and 13 s for computer 1 and 2.

Conclusion
This paper for the first time has presented an efficient blocked vectorised algorithm for producing SIPG face stiffness terms in a native MATLAB environment for linear elasticity for a range of elements in both 2D and 3D.Optimisation was achieved by: (i) maximising the CPU cache reuse by changing the vector size for the BLAS operations; (ii) storing vectors in a column-major form; (iii) ensuring all matrix calculations were as large as possible and (iv) reducing the number of calculations by only considering symmetric terms.
The block length optimisation results demonstrate a clear optimal block length, which is consistent between all integral types and problem types.The peaks coincide with a maximisation of the cache reuse.Additionally a number of different verification techniques have been used to demonstrate correct implementation of both linear systems in both 2D and 3D.
The optimal block length for the hardware used in the study was found to be at ≈ 3 × 10 4 corresponding to a total CPU usage of ≈ 16%, similar to results found in literature.All codes were able to compute the global stiffness matrix for a 10 6 degrees of freedom system in under 30 s.
In the linear elastic 2D performance gain study, in Section 4.4, it was shown that the gain continues to increase with problem size.It was also shown that the performance gains were dependent on element type, with triangular elements achieving gains excess of 50 times.There was also a correlation between gains and the ratio of BLAS calls for the quadrilateral code.This suggests that optimised quadrilateral code still is still subject to a bottle neck from the BLAS call overhead.This was not the case for the triangular code which had significantly fewer BLAS calls.
The script could be optimised further by using the MATLAB's parallel func- M C1 , M C2 , M C3 , and M C4 % are also computed here.

2. 1 .
SIPG weak form for linearly elastic problems Here we consider the following model problem on a bounded Lipschitz polygonal/polyhedral domain Ω in R d , d ∈ {2, 3}, with the boundary ∂Ω N ∪ ∂Ω D = ∂Ω, where ∂Ω D and ∂Ω N are the portions of the boundary where homogeneous Dirichlet and Neumann boundary conditions respectively applied.The strong form of the problem, for small strain hyperelasticity, is defined as

Algorithm 1
Complete Code layout.1: for Area block do 2:

6 :
for SIPG face block do 7: 31) reside in the same location in the global stiffness matrix, therefore only one storage variable needs to be preallocated for M C1 M D1 and M E1 .In Figure 1 the storage variable glob 2 is defined for G 2 on line 1, Figure 1, for all F ∈ F(T ).Performance improvements where found when a second temporary storage variable was used during the local matrix calculation, glob pn defined on line 4, which corresponded to all faces in the current SIPG face block for the set G 2 .Once integration is completed for the current SIPG face block, glob pn is stored into glob 2 on line 21 of Figure 1.
for clarity.The third loop corresponds to the local element face number fn.This is required as local shape function values, Nr, and their local derivatives, dNr, are unique to a local element face.To compute global shape function derivative terms, dNx p and dNy p, for multiple elements simultaneously, only one local face can be considered at time.Therefore manipulation of etpl face is required to only consider SIPG faces in the current block and current local face number.etpl face contains the face information for all faces in F(T ).However as discussed in Section 3.1 the faces are split into SIPG face blocks which are considered one at a time during the vectorised computation.The information for the faces in the current SIPG face block is selected from etpl face and stored in etpl face block by the index block index.Additionally the shape functions and their derivatives can only be computed for one local face number, fn, at time governed by the face loop on line 5 of Figure 3. Therefore the face information for elements K + with the current face number fn is stored in etpl face block fn on line 7.

For
Figure 4b.The Gauss points on F + are numbered clockwise whilst on F − they are anticlockwise.This ensures that the Gauss points for two connected elements align in the global domain.The face integrals are performed with respect to the reference local face coordinate ζ ∈ [−1, 1].To determine the shape functions and shape function derivative values from ζ, it is necessary to convert from the reference line domain to reference element domain, coordinates ξ ∈ [0, 1] and η ∈ [0, 1], with

Here, a
refers to the most clockwise vertex existing at the end of the face, and b the previous vertex.As an example on the triangular element, Figure 4a, face 2, a = A and b = B but for face 1 would be, a = B and b = C.Using the values of ξ and η, mapped from ζ, the shape functions and the reference shape functions derivatives can be determined for each Gauss point location on each face.The face calculations use standard Gauss quadrature weights and locations.
face matrix K F (line 13) the row numbers correspond to the finite elements' degrees of freedom, that pre-multiplied the partial stiffness matrices, of glob 1, glob 2, glob 3 or glob 4. The column numbers represent the degrees of freedom 20 Element type # area Gauss points # face Gauss points Constant strain triangle 1 2 calculate the linear elastic SIPG stiffness matrix K was tested for different block sizes.The block size was logarithmically distributed, 10 → 10 6 , and K consisted of 10 6 degrees of freedom.The performance in terms of Mflops is presented in Figure 8 for lines 1-5 and 6-10 of Algorithm 1, the area and face integrals.The test was preformed on a 2D domain, Ω, where x, y ∈ [0, 1].The mesh distribution within Ω is structured with each element having the same area.The Young's Modulus was set to 10 Pa and Poisson's ratio had a value of 0.2.The tested computer architectures, OS, and MATLAB version used, is For the quadrilateral element, an error of ||u − u h || L 2 (Ω) ≈ 4 × 10 −2 has a BLAS call ratio ≈ 10 and a performance gain of ≈ 10 for both the area and face integral.The same can be said for the linear quadrilateral when considering an error of ||u − u h || L 2 (Ω) ≈ 4 × 10 −2 .The SIPG algorithm has a performance gain for the area and face integral of ≈ 36 and ≈ 27 with a corresponding BLAS call ratio of ≈ 36 and ≈ 27.

Figure 2 :
Figure 2: (a) Example of 3 elements in a 2D DG mesh.Arrows indicate the outward normal direction, values in a box the element number and values on the element edge the local face number.(b) The transpose of the matrix etpl face for DG faces a and b on Figure 2a.nx and ny are the outward normal components, and h is the length of the face, (K + ) is the positive element number with face (f + e ) and (K − ) is the negative element number with face (f − e ).

Figure 4 :
Figure 4: K + gauss point ordering and face ordering for both the constant strain triangular element (a) and bi-linear quadrilateral element (b), and bi-quadratic quadrilateral element (c).ξ and η are the coordinates in the reference element domain, ζ is the coordinate in the reference line domain and g# is the a gauss point number specific to a face number.Note for a face on a negative element the positions of g1 and g2 will be reversed.

Figure 5 : 6 % 6 %
Figure 5: Segment of Matlab script for storing glob 2 into a sparse matrix, where ed is a matrix describing the global degree of freedom numbering for each element.MATLAB code: lines 311-350

Figure 6 :Figure 10 :
Figure 6: A MATLAB script to investigate how the orientation of vector entry-wise multiplications is affected by array orientation.

Figure 11 :
Figure 11: The element mesh distribution for the problem in Figure 10 with triangle elements (a) and quadrilateral elements (b).

Table 3 :
Storage variables and their associated row and column degree of freedom numbers.The row and column degrees of freedom, i and j, have a prefix pos and neg.pos and neg correspond to pre-or-post multiplication of N + , or B + , and N − , or B − .
global stiffness matrix it is first rearranged into a vector form with the MAT-LAB function reshape, line 4 of Figure5.The new data structure of glob 2 is described in Table4.When steering glob 1, glob 2, glob 3 or glob 4 into global

Table 4 :
Reshaping of glob 2 into a vector form glob 2 rs for MATLAB function sparse.

Table 7 :
Fastest computation times of the area and face integral for computer 1 and 2, corresponding to the peak values in Figure8.
parfor and incorporating GPUs into the calculation.The final scripts are designed to be a black box, taking in element topology and outputting the global stiffness matrix for a SIPG problem.Column degrees of freedom for all local stiffness matrices post-multiplied by a K + element, corresponding to global storage vectors glob 1 rs and x, for all F + faces in the current loop.K[max(ed(:)),max(ed(:))] Global stiffness matrix.* dNx p(:,i).* nx)+(C.* dNy p(:,i).* ny)).* int W; % Components of the remaining entries which vary in i of: