Computers and Mathematics with Applications

MATLAB is adept at the development of concise Finite Element (FE) routines, however it is commonly perceived to be too inefficient for high fidelity analysis. This paper aims to challenge this preconception by presenting two optimised FE codes for both continuous Galerkin (CG) and discontinuous Galerkin (DG) methods. Although this has previously been achieved for linear-elastic problems, no such optimised MATLAB script currently exists


Introduction
Finite Element Analysis (FEA) has become a frequently used computational technique in modern industry. The method has had a substantial precedent within the engineering community since the initial concept was first established in the 1940s, originating in the field of structural mechanics. Its development has been greatly aided over the years by the exponential increase of computational power and memory storage, allowing high fidelity problems to be solved rapidly.
Finite element codes can be developed in a number of computing languages including C, C++ and Fortran, however these compiled codes can require thousands of lines of code to be written. Conversely, high-level computational environments, such as MATLAB, allow equivalent routines to be written in a fraction of the time and with far fewer lines. Nonetheless, MATLAB is often overlooked for finite element analysis (FEA) due to its computational inefficiencies, particularly when executing for-loops. To overcome this, a number of optimisation techniques can be implemented in order to reduce run-times. Without these optimisation routines, MATLAB simply cannot compete, in terms of computational speed, with alternative languages for FEA.
MATLAB utilises the Linear Algebra PACKage (LAPACK), written in Fortran, which performs mathematical operations by calling Basic Linear Algebra Subroutines (BLAS) [1]. The poor performance associated with MATLAB for-loops is primarily due to the accumulation of RAM-to-cache overheads, 1 which arise each time data is sent to and from cache. 2 To negate this, data can be sent in the form of a large vector, the BLAS calculations performed in cache before the data is transferred back to RAM. This significantly reduces the overheads associated with the calculation compared to non-vectorised BLAS calls. Optimum performance is achieved when the size of the vectors, referred to as 'blocks', sent to BLAS match the CPU cache memory size such that high speed cache reuse is maximised. Should the size of the block exceed the cache memory, LAPACK executes data-management algorithms and sub-optimal performance is realised [1].
In 2008, Dabrowski et al. [2] published an open source MATLAB code capable of solving a 2D linear elastic problem with 10 6 degrees of freedom in under one minute (MILAMIN). The algorithms, however, include a number of nonnative MATLAB functions, which are written in C. Whilst the use of these non-native functions can improve the overall performance, the code becomes less transportable as the MATLAB software must have the SuiteSparse [3] package installed. Consequently, Bird et al. [4] conducted an investigation into extending the work of Dabrowski et al. to present a highly optimised linear elastic using only native MATLAB functions. Other work in this area includes the extension of the Dabrowski et al. [2] techniques to include parallelisation [5]. More recently, vectorisation techniques have been applied to different element formulations [6,7], comparisons made between different algorithms [8] and general routines devised which are applicable to multiple vector languages [9]. Although the focus of this paper is on non-linear solid mechanic, it is worth noting that the MILAMIN framework [2] has recently been extended to simulate the non-linear deformation of incompressible Newtonian and non-Newtonian fluids [10].
Bird et al. [4] also extended the algorithm to discontinuous Galerkin methods (DG) for linear elastic problems. The DG method introduces an alternative way to assemble and solve the system in question, whereby discontinuities between elements are permitted by averaging the jumps of variables at element interfaces [4]. In contrast to the standard continuous Galerkin (CG) methods, the degrees of freedom at the nodes are not shared by neighbouring elements. To maintain stability and ensure a realistic system response, the local stiffness calculations include a penalising term, which prohibits complete separation of elements faces. DG methods aid in localised p-and h-type refinement of a problem as no intermediate mesh is required to negate hanging nodes [11]. However the calculation of the element stiffness matrices requires additional integral terms and, therefore, is more computationally expensive than CG methods. This further highlights the need for an efficient calculation of the stiffness terms. Moreover, due to the element specific degrees of freedom, the global stiffness matrix is larger than that of a corresponding CG mesh, resulting in longer linear system solve times.
This paper extends the work of [2] and [4] to present two optimised elasto-plastic codes using CG and DG methods. Elasto-plasticity, also referred to as material non-linearity, describes a situation whereby the stress is no longer linearly dependent on the strain. The introduction of elasto-plasticity further improves the FEA approximations, which otherwise do not take into account the non-linear material properties. The first code formulated in this paper builds upon the work of Coombs et al. [12], who present a concise 70-line non-optimised MATLAB script to solve material and geometrically non-linear problems. This code acts as a framework upon which several optimisation techniques are implemented. By employing similar techniques to those used in the optimised CG code and extending the code formulated by Bird et al. [4], a second optimised code is presented for the incomplete interior penalty discontinuous Galerkin (IIPG) method for elasto-plastic problems.
This paper first provides the underlying theory behind DG finite element methods and elasto-plasticity in finite element analysis, before providing a breakdown of the key sections of each algorithm where these concepts are applied. The results obtained from the code are verified against a problem with an analytical solution, before providing performance testing to determine the overall speed gains achieved through optimisation. As the focus of this paper is on the efficient implementation of elasto-plasticity for CG and DG finite element methods, the equations are presented in matrix-vector format to facilitate implementation. We restrict the paper to two-dimensional analysis, although the algorithms can equally be applied to one and three dimensional non-linear problems.

Formulation of the weak form for linear-elasticity
The strong form equation is the governing partial differential equation which requires equilibrium to be satisfied at every point within the domain. For the linear-elastic problem, the strong form can be written as where Ω ⊂ R n is the domain, with boundary ∂Ω= ∂Ω D ∪ ∂Ω N , [L] is the standard differential operator, {σ } and [σ ] are the Cauchy stress in vector and matrix form, respectively, {f } is vector of the body forces, {n} is the unit outward normal to the boundary and {g N } and {g D } are the prescribed Neumann (traction) and Dirichlet (displacement) boundary conditions. To implement the governing equations in finite elements, the weak form of the equation must be derived, such that equilibrium is satisfied in a weighted average sense across all elements in the domain. Also, meshes and finite element spaces have to be defined. The finite element mesh is denoted by T h and it is assumed to be shape-regular and to be constituted by quadrilateral elements. Single elements are denoted by E and faces by F . The set of all interior faces of the mesh T h is denoted by E h . The CG and DG finite element spaces defined on T h are denoted by V CG p (T h ) and V DG p (T h ), respectively, where p is the order of the elements. In the CG case we assume that V CG As the focus of this paper is on algorithmic and performance issues associated with the MATLAB implementation of non-linear solids mechanics, details of the adopted CG and DG formulations are not given in this section. Instead the interested reader is referred to the Appendix for details of the adopted CG and DG methods.

Extension to elasto-plasticity
In contrast to the linear elastic problem, for elasto-plasticity the stress in the domain is no longer linearly dependent on the nodal displacements and the stiffness of the material varies through the physical domain. The fundamental concept of the finite element method is that the determined solution must leave the system in a state of equilibrium. That is, the sum of the applied external forces is equal and opposite to the sum of the element internal forces. This underlying principal is referred to as the non-linear incremental equation of equilibrium where {f oobf } is the residual out-of-balance force vector and {f ext } is the sum of externally applied loads, including body forces and tractions. {f int } denotes the internal forces at each node summed from the element internal force vectors {f e }.
Consequently, it is essential to formulate a set of equations to calculate the internal forces in each element.

Elasto-plasticity
This section will outline the underlying concepts of elasto-plasticity which are implemented in the finite element codes.

Loadsteps
Material non-linearity provides a situation whereby the relationship between the stress and strain varies as the material yields.

Newton-Raphson convergence of out-of-balance forces
By combining the equilibrium concept with the Newton-Raphson method, it is possible to iterate to a solution for each loadstep in a number of steps (where n + 1 denotes the current loadstep number and k the Newton-Raphson iteration number): 1. The process will begin at a previously converged state, where {f oobf } = {0} (point (i) in Fig. 1

5.
Repeat steps (2)-(4) until the normalised residual out-of-balance force converges to a given tolerance, that is in this paper the tolerance is set to 1 × 10 −9 .

Associated flow perfect plasticity
Here we restrict the material to linear elastic, perfectly plastic behaviour with associated plastic flow and a von Mises yield surface (Prandtl-Reuss constitutive model). This is the simplest, general, form of elasto-plasticity, however the algorithms presented in this paper are applicable to more sophisticated forms of non-linear material behaviour.
In 1913, R.E. von Mises [13] postulated an analytical yield surface f , used to define the admissibility of a given stress state. The yield function can be expressed as where ρ is the deviatoric stress, and ρ y is a limiting yield stress. The deviatoric stress ρ is defined as where J 2 is the second stress invariant determined by where [s] is the traceless deviatoric stress matrix, [I] is a 3 × 3 identity matrix, and tr(·) is the trace operator. The von Mises yield surface can most easily be visualised in the principal stress space as an open-ended cylinder with its major axis aligned with the hydrostatic axis (where σ 1 = σ 2 = σ 3 ) with a radius of ρ y . It should be noted that, in this paper, the yield surface remains at a fixed radius, ρ y , due to the perfect plasticity assumption.
For associated flow plasticity theory we assumed that the outward normal to the yield surface provides the flow direction controlling the evolution of plastic strains, that is whereγ is the scalar plastic multiplier rate (or consistency parameter). This plastic multiplier must satisfy the Kuhn-Tucker-Karush consistency conditions These conditions enforce that the material must either be on the yield surface undergoing elasto-plastic deformation (f = 0 andγ ≥ 0) or inside the yield surface with purely elastic behaviour (f ≤ 0 andγ = 0). The above plasticity equations are given in rate form, however in order for them to be implemented within a finite element algorithm, or even used as an stress-strain relationship to predict material behaviour at a single point, they must be integrated into an incremental relationship between stress and strain. This stress integration is the focus of the next section.

Stress integration
The key question that stress integration algorithms are required to answer is: given a current stress state, {σ n }, which is subjected to a strain increment, {∆ε}, what is the updated stress state, {σ n+1 }? Algorithms to answer this question for elasto-plastic constitutive models are normally split into three categories, namely: (i) explicit, (ii) implicit and (iii) exact stress integration. Several papers explore and contrast different stress integration approaches and an interested reader is refereed to the works of [14][15][16], amongst others. Explicit stress integration methods (see for example the initial work of Ilyushin [17] which was later applied to Prandtl-Reuss by Mendelson [18] and generalised by Nayak and Zienkewicz [19]) have advantages in terms of their simplicity but they do not enforce the consistency conditions at the updated stress state [20]. Implicit stress integration algorithms are normally formulated in terms of a prediction step followed by a correction for stress stated that induce elasto-plastic behaviour. These approaches were initially formulated by Wilkins [21] and have been developed over the last 50 years [22][23][24][25][26], including some analytical (closed-form) techniques [27][28][29] and general approaches that can be applied to arbitrary yield envelopes [30][31][32]. For textbook accounts of implicit methods see [33,34], amongst others. The key advantage of implicit methods over the earlier explicit approaches, despite their additional computational complexity, is that they rigorously enforce the consistency conditions at the new stress state. They also allow for larger stress increments to the applied to the constitutive model. However, in this paper we adopt an exact stress integration approach for von Mises elasto-plasticity of Wei et al. [35] for reasons outlined below.
The primary advantage of exact stress integration methods [15,[35][36][37][38][39] is that they remove any errors associated with the stress integration process. However, exact stress integration methods are generally considered to be too expensive for routine engineering stress analysis [39]. In this paper we challenge this criticism as, due to their closed-form nature, exact stress integration routines are far more amenable to vectorisation as compared to iterative approaches (such as implicit stress updating). See Wei et al. [35] for details of the stress integration algorithm and consistent tangent, and Coombs et al. [12] for an example conventional (non vectorised) implementation. Section 4.7 provides details on the efficient implementation of the exact stress update algorithm.
An additional advantage of adopting an exact (or implicit) stress update algorithm is that it allows determination of the algorithmic consistent tangent [40,41] which facilitates optimum convergence of the global non-linear problem (3). The benefits of using the consistent tangent operator have recently been demonstrated by Duretz et al. [42]. This tangent is the linearisation of the stress updated algorithm with respect to changes in the trial elastic strain state and replaces [D] in the global stiffness matrix, [K ], assembled using (A.2). The algorithmic consistent tangent derived by Wei et al. [35] can be expressed as where η i are variables that depend on the stress state and the deviatoric component of the applied strain increment, {∆ϵ}. {s n } is the vector form of the deviatoric stress, [s] see (7), evaluated at the previously converged (or initial) stress state. As η 2 ̸ = η 3 , the algorithmic consistent tangent is non-symmetric and the numerical implications of this will be explored in Sections 4.7 and 4.8.

Elasto-plastic optimised CG algorithm
Algorithm 1 outlines the code structure used for the optimised elasto-plastic MATLAB algorithm, highlighting the key sections of code, which are further explained in the subsections that follow. However, before looking at specific aspects of the code it is helpful to explain the main speed-up mechanism -blocking.
Standard finite element implementations loop over each Gauss point for each finite element and determine the stiffness contribution of each Gauss point to their parent element and then assemble the element stiffness matrix into a global stiffness matrix. The critical speed up mechanism for the algorithm shown in Algorithm 1 is blocking. Elements are grouped into blocks, where the optimum block size is dependent on the physical architecture of the machine used to perform the assembly calculations, and then vectorised stiffness calculations are performed for all the elements in the block for a specific Gauss point location. The stiffness contributions are then assembled into a global stiffness matrix.

Mesh set-up and block size determination
Line 1 of Algorithm 1 denotes where the problem set-up and script initialisation occur. The mesh set-up is achieved by calling a MATLAB function, which returns the element coordinates coord, the element topology etpl and boundary conditions bc. Furthermore, the material properties of Young's modulus and Poisson's ratio are returned, which are used to calculate the elastic stiffness and compliance matrix. These matrices are used several times later in the code, and hence it is favourable only to declare these variables once in the code. Additionally, it is here that the material yield stress is defined.
The number of elements per block nelblo is determined from a user input, referring to the size of the vectors sent to BLAS, which is dependent on CPU cache size. Finally, memory allocation is declared for two vectors used to store all of the element stiffness matrices and element internal force vectors, K_all and felem_all respectively, until the global stiffness matrix is assembled. The allocations are defined as K_all=zeros(nels,nDoF,nDoF) and felem_all = zeros(nels,nDoF), where nels is the number of elements in the domain and nDoF is the total number of degrees of freedom for an element.

Out-of-balance residual calculation
Lines 3 and 4 of Algorithm 1 are responsible for updating the external force vector fext and calculating the residual out-of-balance force oobf, respectively. Fig. 2 shows the corresponding code fragment. It can be observed that the external force vector is updated upon each load step lstp, as an appropriate proportion of the total external force applied fext0, where no_lstps is the total number loadsteps.
The residual out-of-balance force is calculated as the difference between the internal force vector fint and the sum of the external and reaction forces (fext and react respectively). It should be noted that the L2 norm variable oobfnorm is assigned a value of twice the Newton-Raphson tolerance NRtol, such that the while() loop is entered on the first iteration of each loadstep.
Furthermore, an iteration counter NRit is included to ensure the number of iterations does not exceed the user defined maximum number iterations NRitmax, should the solution fail to converge.

Linear solution for displacement increment
Line 6 of Algorithm 1 is where the system solver function is called. Using the out-of-balance force array oobf and the global stiffness matrix from the previous iteration K, the solver returns the incremental displacements and reaction forces, subject to the system boundary conditions. The solver uses the native MATLAB function mldivide(), which is capable of solving a large system of linear equations.

Element block memory allocation
Line 8 of Algorithm 1 denotes where the memory allocations for the current block take place. Memory is assigned to the following variables: the Jacobians Jx, Jy, inverse Jacobians invJx, invJy, the block storage local stiffness matrices K_block, the block storage element internal force vector felem and the updated stress vector sig_new. The variables K_block and felem act as temporary storage for all elements considered within the block before they are stored in K_all and felem_all, respectively. Table 1 presents the size of each variable to which memory is assigned in this section, where nD is the number of degrees of freedom at each node. At the start of each loop, K_block and felem are required to be set to zero, so as to eliminate the possibility of incorrect addition during the Gauss point loop.

Gauss point initialisation
Line 10 of Algorithm 1 denotes where the derivatives of the shape functions, with respect to the global coordinates, are determined dNx, dNy. The method used is consistent with that of [2] and [4] for the volume and surface terms and, therefore, will not be discussed.

Strain increment calculation
Line 11 of Algorithm 1 denotes where the strain increment is calculated at the Gauss point locations. To achieve maximum speed gains, a set of expressions must be derived such that the strain increment {δε} can be simultaneously calculated for the current Gauss point of every element in the block. For plane-strain/stress problems, the strain increment expressions can be written as where δu and δv are the incremental displacements at the nodes as returned from the solver. ∂N ∂x and ∂N ∂y are the shape function derivatives with respect to the global coordinates x and y respectively, and n en is the number of element nodes. By multiplying out (11), expressions are obtained for each strain direction as outlined in the code fragment shown in Fig. 3, where deps is the strain increment and du, dv are the incremental displacements in x and y respectively.

Constitutive model
Line 12 of Algorithm 1 is where the constitutive model is called. This function is used to assess the stress state of each Gauss point, returning an updated [D] matrix. It should be noted that this constitutive model is consistent with the function implemented by Coombs et al. [12], however it has been rewritten in a vectorised block form to reduce the number of calls to BLAS. This requires the separation of all the Gauss points in the current block into those acting elastically and those undergoing elasto-plastic deformation. This is done by implementing the von Mises yield surface introduced in Section 3.3, shown in the code fragment in Fig. 4. 3 Here, epsEtr is the trial strain state of the Gauss point, calculated by adding the strain increment deps to the previously converged strain state epsEn. By multiplying this trial strain state by the elastic stiffness matrix De, the trial stress state sig_tr is obtained for every element in the current block.
An expression for the yield function can be derived by substituting (7) into (6), and subsequently substituting the resultant equation into (5). This gives Within the code, this is implemented by first calculating the trace of the deviatoric stress matrix in vector form, which is assigned to the variable s_tr. The squaring of s_tr is calculated using sum(s_tr. * s_tr,1), such that a single scalar quantity of ρ is obtained for each Gauss point in the block. Consequently, the vector of the yield function f is calculated. Two logical statements are made using this vector: elast_indx = find(f<0) and plast_indx = find(f>=0). These statements return vectors containing the locations of the elastic and plastic Gauss points within the f vector where f<0 and f>=0 respectively. That is, elast_indx will contain an index for the elements that contain an admissible stress state for the Gauss point in question, which are therefore elastic. Conversely, plast_indx denotes the elements for which the stress state of the Gauss point lies on or outside the yield surface, which will undergo elasto-plastic deformation.
The variable plast_indx is also used as a logical indicator, as shown in Fig. 4. The script section bounded by 'if plast_indx' will only be entered if there exists one or more plastic Gauss point in the current block. Should this be the case, this section of code will be executed in order to calculate the [D] matrix for all elasto-plastic Gauss points. See [35] or [12]  . Additionally, the updated stress and strain states are calculated and outputted from the function, epsE and sig_new respectively. Again, these variables are plastic dependent and, therefore, also require assembling using elast_indx and plast_indx.

Local stiffness calculation
where a, e and i retain their original values. Although this results in sub-optimal convergence through a small error in the global stiffness tangent, the speed gained through the assembly of the local stiffness matrices results in an overall performance improvement. It should also be noted that the performance of the solver is enhanced when solving a symmetric matrix. From here, a similar technique is used to those evident in [4] and [2], whereby the upper triangle of the stiffness matrices is calculated explicitly by looping over the element nodes. Each term within the stiffness matrix is a function of the values from the [D] matrix, the shape function derivatives dNx and an integration weighting w. This Jacobian determinant detJ is included within w. The key difference is that [D], in general, can be different at each Gauss point in each element.

Local internal force calculation
Line 14 of Algorithm 1 denotes where the local element internal force vector is calculated. The expression for this is obtained by summarising (A.6) as By multiplying out this equation, expressions are derived for each element internal force vector entry. The implementation of this is achieved by looping over the element nodes, as shown in Fig. 5.

Assignment to temporary block storage
Line 16 of Algorithm 1 denotes where the local stiffness and internal force vectors for the current block are assigned to temporary variables, K_all and felem_all respectively. These variables remain unused until the calculations for all blocks are complete. The variables are subsequently used to assemble the global stiffness matrix and internal force vector.

Global stiffness matrix and internal force vector assembly
Line 18 of Algorithm 1 refers to where the global stiffness matrix K and global internal force vector fint are assembled. The global stiffness matrix is assembled using the native MATLAB function sparse(), a memory efficient way to store large matrices by only storing non-zero terms and their corresponding matrix locations.
The assembly of the global stiffness matrix is therefore achieved by executing K=sparse(ed_i,ed_j,K_all,tndof,tndof), where ed_i, ed_j are the steering vectors for rows and columns respectively. 4 The overall size of the matrix is defined by tndof, the total number of degrees of freedom for the system. The assembly of fint is achieved by executing where ed is a vector denoting the degrees of freedom at each node. The MATLAB function accumarray() is used to construct an array with accumulation. This is necessary here, where the internal forces at nodes can contain contributions from more than one element.

Update out-of-balance residual force
Line 19 of Algorithm 1 is where the out-of-balance residual force is recalculated using the newly calculated internal force vector. The L2 norm of the subsequent vector is obtained using the native MATLAB norm() function. This value is then compared to the Newton-Raphson tolerance and, if smaller, the loadstep has converged and the while-loop is exited.

Set reference variables equal to converged values
Line 21 of Algorithm 1 denotes where the reference variables are set to the converged values from the completed loadstep. The reference variables required for the next loadstep are the total elastic strain vector epsE and the total displacement uvw, which are assigned to epsEn and uvwold respectively.

Elasto-plastic optimised algorithm in DG
The elasto-plastic optimised DG code takes the same form as the CG code, however it requires one extra loop to determine the stiffness and internal force contributions from the element faces. The DG algorithm can be expressed by simply including the additional pseudo code presented in Algorithm 2 between lines 17 and 18 of Algorithm 1. The stiffness and internal force calculations discussed in the CG code can be directly implemented into the DG code, as this is identical to the area integral required for term (Q ) in (A.2). It should be noted that lines 1 to 6 are unchanged from the DG linear-elastic code presented by Bird et al. [4] and, therefore, will not be discussed.

Strain increment calculation
The DG strain increment calculation for the DG face terms is performed on Line 7 of Algorithm 2. As in the CG code, the strain increment due to increased externally applied load must be calculated. However, the DG code differs such that the strain increments must be obtained at the current Gauss point for both the + and − elements. Fig. 6 shows the implementation of this within the code for the first component of strain δε xx only. Here, deps_p and deps_n denote the positive and negative strain increments respectively, while dNx_p, dNy_p and dNx_n, dNy_n are the positive and negative shape function derivatives respectively. Additionally, the incremental displacements of the positive and negative elements are denoted by du_p and du_n respectively. That is, for the positive element face ∂E + , the strain state from the previously converged loadstep epsEn_surfp and the Algorithm 3 Local internal force calculation for DG. 1: for Loop over element nodes do 2:

Constitutive model for positive and negative Gauss points
Calculate penalty contribution terms 3: end for 4: for Loop over element nodes do 5: Calculate the degrees of freedom for current node 6: Calculate the surface contribution terms 7: Assemble local internal force using penalty and surface contribution terms 8: end for positive strain increment deps_p are sent to the constitutive model. This is also the case for the negative element face ∂E − , using epsEn_surfn and deps_n instead.
Each Gauss point is determined to be elastic or plastic using the same von Mises criteria discussed in Section 4.7 and the appropriate [D] matrices are returned. Again, the values of these matrices are returned in MATLAB structure form, where Dp.a. . . i and Dn.a. . . i denote the positive and negative [D] matrices respectively. It should be noted that the values a . . . i take the same asymmetrical form as shown in (13). Also returned are the updated non-linear stress states sig_p and sig_n and the strain states epsE_surfp and epsE_surfn, for the positive and negative surfaces respectively. By executing two nested for-loops over the element nodes, the stiffness terms for each node's degrees of freedom can be calculated. The repeated terms that arise from multiplying out the R 1...4 expressions vary only in columns and, therefore, lie outside the second nested for-loop. Consequently, these terms are only calculated nen number of times, reducing the number of calls to BLAS and, thus, reducing the overhead time. However, due to the omission of the symmetric DG term, the entirety of each stiffness matrix must be determined. This differs from the optimised linear elastic code, whereby only the upper triangle is calculated and subsequently transposed.

Local stiffness calculation
Within the nested for-loop, the R 1...4 stiffness terms are calculated and placed into the correct location within the matrix, using steering terms mult_i and mult_j. The stabilising terms S 1...4 are also calculated here, which are derived from the derivatives of the shape functions, nodal displacements and integration weightings. These stabilising terms are subtracted from the R 1...4 terms, in accordance with (A.2), to obtain the overall stiffness matrix contribution for the Gauss point in question. On each Gauss point loop, the stiffness matrix contributions from the current Gauss point are summed with those from the previous loops.

Local internal force calculation
Algorithm 3 describes the code structure used to calculate the internal force contributions, line 10 of Algorithm 2, from terms R 1...4 and S 1...4 , utilising two sequential for-loops. The initial loop calculates the internal force components, which arise from the penalty terms S 1...4 and are functions of the shape functions and the node displacements for both the positive and negative elements. The second loop is used to calculate the internal force components arising from the surface integral terms R 1...4 and are functions of the outward facing normal vectors and the non-linear stress components of the positive and negative elements. The overall element internal force components at each element degree of freedom can subsequently be determined by combining the contributions from the area integral (calculated at line 14 of Algorithm 1), the surface integral and the penalty terms.
It should be noted that the first for-loop cannot be nested within the second, as the penalty term is a cumulative variable, which requires information from every element node before it can be used to calculate the internal forces at each degree of freedom.

Assignment to temporary storage block
This occurs on Line 12 of Algorithm 2 and is equivalent to Line 16 of Algorithm 1 in CG, with the inclusion of two temporary variables for the positive and negative element internal forces, felem_p_all and felem_n_all respectively.

Numerical results and discussion
This section presents the analysis of a two-dimensional elasto-plasticity problem with a known analytical solution. The section initially focuses on determination of the optimum block size for the hardware/software used and then investigates the speed-up of the optimised algorithm.

Determining the optimum block size
Optimum performance is reached by maximising the cache reuse and, therefore, is CPU dependent. Performance testing was carried out to determine the optimum block size for the computer in use. All simulations were run within the Windows 7 OS on version MATLAB R2015b. The CPU used an Intel Xeon E5-1620 V2@3.70 GHz processor with 64GB of RAM.
To determine the optimum block size, a for-loop was set up around the vectorised constitutive model (the bottleneck in the vectorisation of the non-linear finite element algorithm), which increased the block size by 200 on each loop. All variables were cleared at the beginning of each loop and the stress, strain and incremental strain vectors were subsequently declared. These variables were consistent for each block size, to ensure the tested trial stress state was deemed inadmissible, such that the plastic section of code was executed. The variables were repeated to match the block size, using the MATLAB repmat() function.
Each loop was timed and subsequently divided by the block size to determine the average calculation time per Gauss point in the block. By plotting the inverse of these times, the peak speed can be observed, as shown in Fig. 7. It should be noted that each loop was executed 40 times, from which the mean time was calculated, to reduce the effect of speed variation due to CPU background tasks. Here, the optimum block size was observed to be ≈ 13000 and, henceforth, this was the block size used for all simulations 5 . The grey region in Fig. 7 is the region where cache-reuse is maximise and the overhead associated with data transfer is relatively insignificant. For block sizes > 15000, the routine requires more memory than available in the CPU cache and this inhibits cache reuse (see Dabrowski et al. [2] for a discussion on this point). Note that the block size is not influenced by the number of Gauss points used to integrate the finite element as the Gauss point loop contains the blocking algorithm.

Result validation against an analytical solution
The results obtained for both CG and DG were validated using a problem with an analytical solution. The problem analysed was the stretching of a double-notched plate, initially presented by Nagtegaal et al. [43] for small strain plasticity.
The plate assumes a Young's modulus E = 206.9 GPa, a Poisson's ratio ν = 0.2 and a yield stress ρ y = 0.45 GPa.
The height and width of the specimen were defined to be 30 mm and 15 mm respectively, with a 2 mm unit linking ligament at mid height, as shown in Fig. 8(ii). For this implementation, a plane-strain assumption was made, such that the total out-of-plane strains were assumed to be zero. Due to the symmetrical nature of the problem, only one quarter of the specimen was discretised using 75 bi-linear four-noded quad elements integrated using 2 × 2 Gauss quadrature (four points per element). For this geometry, Nagtegaal et al. [43] postulates a small strain analytical load due of F lim ≈ 2.673kN. Fig. 8(i) shows the force-displacement graph for the first 6 refinements of the mesh, run using the optimised CG script. Each refinement halves the element size in both directions, hence quadrupling the number of elements. In each case, a displacement of 2 mm was applied to the top edge of the plate over 20 equal loadsteps.
It is clear from the responses that the numerical solution from the CG code converges towards the correct analytical solution. It is also evident that the same response is achieved in DG, as shown in Fig. 9, which compares the error after each refinement. The degrees of freedom for each of the meshes is given in Table 2.

Average-symmetric vs non-symmetric [D] matrix
This section assesses the effect of applying the average-symmetric [D] matrix in the CG optimised code. The effect of the matrix symmetricity is analysed for both the Newton-Raphson convergence rate and the system solver performance. Fig. 10 presents two convergence graphs for loadstep 5 of the fourth refinement. It is evident from both graphs that using a non-symmetric [D] matrix results in a quicker convergence rate, such that it requires two fewer iterations. It can also be seen from Fig. 10(ii) that the logarithmic gradient of convergence of the non-symmetric matrix is ≈2 or greater for each iteration. This suggests correct implementation, as the Newton-Raphson method should asymptotically approach quadratic convergence if the correct tangent is used. Conversely, the average symmetric matrix converges sub-optimally. This is clearly apparent in Fig. 10(ii) whereby the logarithmic gradient of convergence is < 2 for all iterations. This is due to the error in the global stiffness tangent acquired when forcing the [D] matrix to be symmetric. Although this results in the need for additional iterations per loadstep, dramatic speed gains are achieved in both the assembly of the local stiffness matrix and the solver time. This is shown in Fig. 11, which shows the time to assemble the local stiffness matrices, the solver time and the overall time for the first six refinements. The speed gains are clearly evident for all three measured sections of code, with maximum  speed gains obtained for the sixth refinement. Here, applying the average-symmetric matrix method yielded speed gains of ×1.58, ×3.05 and ×1.94 for the local stiffness assembly, solver time and overall simulation respectively.

Identification of performance bottlenecks
It is possible to identify performance bottlenecks by analysing the time spent in various sections of the code. Fig. 12 shows a time breakdown of the DG optimised code, where the solver time, surface integral time, volume integral time and sparse assembly are plotted as a percentage of the overall simulation time. The graph clearly highlights the dominance of the solver time for large problems. For the sixth refinement, the solver time accounts for 76% of the total simulation time, acting as a performance bottleneck. This outcome was also observed within the optimised CG code, however the asymmetrical nature of the DG global stiffness matrix results in greater performance degradation.

Overall performance improvements
The overall performance improvements were analysed by comparing the elasto-plastic CG and DG optimised codes to their corresponding non-optimised MATLAB scripts for the notched plate problem. The simulation times for the first six refinements can be observed in Fig. 13 and Table 3 for both CG and DG optimised and non-optimised scripts.
For the CG code, the maximum speed gain of ×25.7 is achieved for refinement 4. Beyond this, the solver time begins to dominate and the speed enhancement is marginally reduced, with a speed gain of ×22.0 for refinement 6. Maximum speed gains of ×10.1 are obtained for refinements 3 and 4 of the DG code, beyond which the solver time dominates heavily. By excluding the solver time, set-up time and sparse assembly time, a true comparison can be made between   the optimised sections of code and the corresponding non-optimised sections. Table 4 provides a summary of this data, showing maximum speed gains of ×52 and ×47 for CG and DG respectively.
The total number of iterations and the run time per Newton-Raphson iteration are given in Table 5. The run time per iteration is based on the total time minus the solver time, set-up time and sparse assembly time, that is the run times given Table 4. The maximum speed gain per iteration for the CG method is ×55 which is greater than the speed gain based on Table 4. This is due to more iterations being required in the optimised symmetric CG algorithm which has sub-optimal convergence, as shown in Section 6.3. However, in the authors' opinion quantifying the speed gains in terms of the total simulation time is more appropriate for non-linear algorithms if they require different numbers of iterations to find convergence. It is worth noting that modern BLAS libraries support multithreading and in order to investigate the impact of multithreading, MATLAB was initiated with a single thread 6 and the run times with a single thread compared with the default four threads for the optimised CG code. The runtime change moving from four threads to a single thread was insignificant apart from the 6th refinement where the runtime increased by 6.7%. However, the majority of this increase in runtime was in the solver, with an increase of 14.1%, and in the constitutive model, with an increase in runtime of 5.8%. The runtime change of the rest of the algorithm combined was negligible confirming that the majority of the stiffness matrix calculations are memory bandwidth bounded rather than compute bounded.

Conclusions
This paper presents two optimised codes capable of setting up and solving elasto-plastic problems in 2D, for CG and DG. Firstly, the optimised CG code was developed, building upon the work of Coombs et al. [12] who present a 70-line MATLAB script to solve for material and geometrically non-linear problems. In this code, the major performance bottleneck is observed from the large number of calls made to BLAS, resulting in the build up of overheads. The accumulation of these overheads predominantly occur in the constitutive model, where a high number of mathematical operations are performed on every Gauss point within the domain. By implementing blocking algorithms, initially proposed by Dabrowski et al. [2], the number of calls to BLAS are significantly reduced and, therefore, so are the overheads. This results in speed gains of ≈100 times in the constitutive model alone.
It was also found that the [D] matrix is often non-symmetric for plastic Gauss points. This has a negative consequence on the formulation of the local stiffness matrices [k e ], such that all the matrix terms must be calculated. To negate this, the code adopts a technique which forces the plastic [D] matrices to be symmetric by averaging the off-diagonal terms.
Consequently, only the upper triangle of the local stiffness matrices is required to be calculated, significantly reducing the number of calculations required. Once the global stiffness matrix [K ] is assembled, the complete matrix is determined by summing this matrix with the transpose of itself. Moreover, the symmetricity of global stiffness matrix results in improved solver performance. The penalty for using average-symmetric [D] matrices comes through sub-optimal Newton-Raphson convergence at each loadstep, due to the small error in the global stiffness tangent. However the effect of this is outweighed by the speed gains achieved elsewhere in the code, resulting in a maximum overall simulation speed gain of ×1.94 when compared to the non-symmetric optimised code. This further breaks down to speed gains of ×1.58 and ×3.05 for the local stiffness formulation and solver time respectively.
Comparison of the optimised CG code to the equivalent non-optimised code by Coombs et al. [12] yields a maximum overall simulation speed gain of ×25.7 (and ×52 if the solver, setup and sparse time are disregarded). To ensure the correct implementation of the methods, the results were validated against a problem with an analytical solution. 6 This can be done via the command line using matlab -singleCompThread.
The optimised DG code takes an identical structure to the CG code, with the inclusion of a surface integral loop. The surface loop used is similar to that developed by Bird et al. [4], who presents an optimised SIPG code for linear elasticity. However, due to difficulty in deriving the interior forces for the SIPG method, this paper adopts a IIPG approach. Furthermore, this loop requires two additional calls to the constitutive model to assess the stress state of surface Gauss points on the positive and negative elements. From this, appropriate [D] matrices are returned. Unlike the CG code, the DG script does not force the [D] matrices to be symmetric as the IIPG method will result in a non-symmetric stiffness matrix regardless. For this reason, the speed gains achieved are not as substantial as for CG, as all terms in the stiffness matrices must be calculated. Nonetheless a maximum speed gain ×10.1 was observed in comparison to the equivalent non-optimised code.
It was also identified for both codes that with increasing refinement of the mesh, the solver time begins to dominate the overall simulation. This bottleneck has a greater disadvantage on the DG code due to the global stiffness matrix [K ] being non-symmetric alongside being much larger as a result of unshared degrees of freedom at the nodes. For example, for a DG analysis with ≈ 6 × 10 5 degrees of freedom, the solver time accounted for 76% of the overall simulation time.
Although the current speed gains achieved are significant, the bottleneck now resides in the linear solver rather than the stiffness assembly as for standard MATLAB finite element implementations. The presented computational efficiency codes could be used in solving more challenging problems. Recently Bird et al. presented a DG code for configurational force crack propagation for brittle materials [44]. The DG code presented in this paper could be used together with the configurational force method in [44] to simulate crack propagation in materials with plastic behaviour. The efficiency of the CG and DG codes could be further increased introducing adaptivity to the problem. An hp a posteriori error estimator for linear elasticity in the DG setting is presented in [45]. Due to the similarities between linear elasticity and the linear system arising from the Constitutive model in Algorithm 1, it plausible to foreseen the possibility to apply the a posteriori error estimator from [45] also in the elasto-plastic context.
In addition to the above points, it should be noted that the developed code is only applicable to the analysis of material where yielding is governed by a von Mises yield surface (for example metals or undrained soils). However, it is straightforward to replace the von Mises constitutive model with an alternative depending on the material analysed. The advantage of the adopted closed-form von Mises algorithm is that it allows straightforward vectorisation which may be problematic for other constitutive algorithms that require iterations to update the stress state and stiffness of the material. This is an interesting area for future research.
where A is the standard finite element assembly operator, {d} contains the displacements for all nodes in the finite element mesh, [K ] is the global stiffness matrix, (·) denotes an integral over E, ⟨·⟩ denotes an integral over ∂E ∩ Here, the + and − denote element properties corresponding to E + and E − respectively, which are defined as neighbouring elements, connected by a face F . The penalty term η σ = β F Ep 2 h −1 F , where E is the elastic Young's modulus of the material, β F is the stabilisation penalty parameter which may vary from face to face [49] (set equal to 10 in this paper), p is the polynomial order of the element and h F is the face length, and the positive outward facing normal matrix [n] is as defined by Bird et al. [4] for SIPG. It should be noted that for CG only terms (P) and (Q ) need to be evaluated. Numerical integration is carried out using Gauss-Legendre quadrature, which requires n Gp sampling Gauss points located throughout the element. For example Q is approximated using where [J] is the Jacobian mapping between the global and local coordinates and w i is the weight associated with the Gauss point.

A.2. Internal force vector: elasto-plasticity
The element internal force vector is simply obtained by multiplying the element stiffness matrix by the displacement at the nodes. For CG finite element analysis the internal force of an element can be expressed as  .4). This is because the penalty terms do not contain a non-linear component, as they simply penalise the displacement jump terms between adjacent elements. In this paper the non-linear equilibrium equation (3) is solved using a fully implicit Newton-Raphson algorithm. This requires linearisation of the equilibrium equation (3) in order to obtain the global stiffness matrix, [K ]. The form of the global stiffness matrix for elasto-plasticity is the same as that given in (A.2) but the material stiffness matrix, [D], is now non-linear in the nodal displacements. This is explained in more detail in Section 3.4.