On the fast assemblage of finite element matrices with application to nonlinear heat transfer problems

The finite element method is a well-established method for the numerical solution of partial differential equations (PDEs), both linear and nonlinear. However, the repeated reassemblage of finite element matrices for nonlinear PDEs is frequently pointed out as one of the bottlenecks in the computations. The second bottleneck being the large and numerous linear systems to be solved arising from the use of Newton's method to solve nonlinear systems of equations. In this paper, we will address the first issue. We will see how under mild assumptions the assemblage procedure may be rewritten using a completely loop-free algorithm. Our approach leads to a small matrix-matrix multiplication for which we may count on highly optimized algorithms.


Introduction
The mass and stiffness matrices are classical matrices arising when the finite element method (FEM) is used to discretize in space a PDE.Although the expression of the stiffness matrix differs from one application to another, it always depends on the gradients of the basis functions.Assembling these matrices may be computationally demanding, especially for nonlinear problems when they have to be reassembled several times.Hence, designing efficient matrix assemblage algorithms is critical.Several contributions have focused on designing algorithms of optimal or quasi-optimal complexity.An assemblage algorithm is optimal if O(1) operations are required per nonzero entry.In [1,2], the authors designed an optimal complexity algorithm by choosing Bernstein polynomials as basis functions and making good use of their properties.Other researchers have exploited the tensor product structure of basis functions leading to sum factorization techniques [3] and weighted quadrature [4,5].Low-rank approximation techniques have also been proposed for tensor product basis functions in the context of isogeometric discretizations, allowing to approximate global matrices by a sum of Kronecker products and only assembling the (much smaller) factor matrices [6,7,8].Finally, the geometry and physical nature of the problem may also be used to reduce the complexity in some circumstances [9].Besides complexity issues, another line of research has focused on designing algorithms that reconcile efficiency with the intrinsic ease of implementation of some programming languages.Indeed, classical assemblage algorithms require a loop over all finite elements.Although it may not be an issue for low level programming languages, it severely problem in strong form: find u : Ω × [0, T ] → R such that u(x, 0) = u 0 (x) in Ω (4) where ∂Ω D and ∂Ω N form a partition of the boundary ∂Ω such that ∂Ω D ∪ ∂Ω N = ∂Ω and ∂Ω D ∩ ∂Ω N = ∅.Equation ( 1) is the differential equation to be solved.Equations ( 2) and ( 3) prescribe Dirichlet and Neumann boundary conditions, respectively, and Equation ( 4) is an initial condition.We will assume the scalar coefficients m, c and a and the function g N may depend on the space and time variables as well as the unknown whereas the right-hand side function f and the function g D depend only on the space and time variables.This model problem encompasses many engineering applications such as heat transfer problems and groundwater flow.We will make all the necessary assumptions on the data such that the problem stated is well-posed.We refer to classical textbooks such as [14], Chapter 5 for a more detailed description.The first step is to derive the weak form of the differential problem.We begin by multiplying Equation (1) with a test function v.After using Green's identity and the divergence theorem, we obtain where we have defined V 0 = H 1 0 = {v ∈ H 1 : v| ∂Ω D = 0} as the space of test functions.Further defining the symmetric bilinear form a(u, v) = Ω c∇u • ∇v dΩ + Ω auv dΩ, the problem now reads: find u : Ω × [0, T ] → R with u(., t) ∈ V = {v ∈ H 1 : v| The Galerkin method consists in searching for an approximation u h in a suitable finite dimensional space V h ⊂ V .To ease the exposition, let us assume that g D = 0 such that V = V 0 .Hence, we search for In the finite element method, the approximation space V h is taken as the space of piecewise polynomials.Denoting n = dim(V h ) and {φ i } n i=1 the Lagrange basis for V h , the discrete solution u h can be expressed as u h (x, t) = n i=1 u i (t)φ i (x) where {u i (t)} n i=1 are unknown time-dependent coefficients.It is enough to enforce Equation ( 6) on all basis functions {φ i } n i=1 since any v h ∈ V h can be expressed as a linear combination of the basis functions.Hence, Equation ( 6) is equivalent to the set of equations  g N φ i dS ∀i = 1, . . ., n. (7) We now define the mass matrix M ij = Ω m φ j φ i dΩ, i, j = 1, . . ., n, the stiffness matrix K ij = a(φ j , φ i ), i, j = 1 . . ., n, the right-hand side vector f (t) such that f i (t) = Ω f φ i dΩ+ ∂Ω N g N φ i dS, i = 1, . . ., n and u(t) the vector of coefficients {u i (t)} n i=1 .Equation (7) can now be conveniently rewritten using vector notation as M u(t) + Ku(t) = f (t) (8) where M, K ∈ R n×n and f ∈ R n .Note that the stiffness matrix here accounts for two terms: Let us denote C ij = Ω c ∇φ j • ∇φ i dΩ and A ij = Ω a φ j φ i dΩ.From its definition, the matrix A has the same structure as the matrix M .Hence, we will focus on the assemblage of the mass matrix M and the conductivity matrix C. In the finite element literature, the matrix C is also commonly called a stiffness matrix.However, the more general setup we have considered here required an extended terminology.

The classical assemblage procedure
Part of the assemblage strategy is common to all matrices.Thus, let us consider the mass matrix for simplicity.
1.The assemblage procedure first takes advantage of the additivity of the integral such that where T e for e = 1, . . ., n e are the finite elements making up the mesh and n e is the number of elements.We will assume the domain is polyhedral such that it can be represented exactly by a union of triangles in 2D (tetrahedra in 3D).
2. The second step is to take advantage of the compact support of the basis functions.Indeed, the support of φ i is the union of all triangles T k to which node i belongs.Hence, Te m φ j φ i dΩ = 0 if i and j do not belong simultaneously to the finite element T e .Therefore, many of the terms (if not all) of the sum in Equation ( 9) are in fact zero for a given pair of indices (i, j).It is this property that leads to sparse matrices.Hence, it is convenient to consider only the nonzero contributions obtained by restricting the indices i and j to those that belong simultaneously to T e .Let N e denote that set of indices.Therefore, for each triangle T e , we compute the integrals where the vector φ e contains the basis functions {φ i } i∈Ne .
3. Unfortunately, the domain of integration in Equation ( 10) is different for each finite element, which is not convenient for a computer implementation.Thus, we perform a change of variables.There are different options available.Here, we will consider an affine mapping between a reference triangle T and the current triangle T e .This mapping is illustrated in Figure 1.We perform the change of variables x = F e (x) = a + B e x where B e ∈ R d×d is defined as . This mapping has the advantage that its Jacobian matrix is constant and trivially given by J Fe = B e regardless of which finite element order is used.
T < l a t e x i t s h a 1 _ b a s e 6 4 = " P 1 9 The advantage is twofold.First of all, the integration domain no longer depends on the finite element.Secondly, the basis functions only need to be defined on the reference triangle.We refer to classical textbooks such as [14] for their expression depending on the finite element order.
4. Finally, integration is most efficiently done on a computer using numerical quadrature.In that respect, Gaussian quadrature which features the minimum number of quadrature nodes n q for a given degree of exactness is particularly attractive because it minimizes the number of function evaluations.Let {w q } nq q=1 and {x q } nq q=1 be the set of quadrature weights and nodes, respectively.Sets of quadrature weights and nodes are listed for example in [15] for different degrees of exactness.When numerical quadrature is used, the integral is approximated as follows M e ≈ nq q=1 w q mq φq φT q | det(B e )| e = 1, 2, . . ., n e where we have denoted φq = φ(x q ) and mq = m(x q ).
The assemblage of the conductivity matrix C is slightly more technical.Up to step 2, the procedure is the same and we consider Te c ∇φ j • ∇φ i dΩ i, j ∈ N e e = 1, 2, . . ., n e .
We may rewrite it more compactly as where J φe denotes the Jacobian matrix of the basis functions {φ i } i∈Ne .At step 3, the change of variables leads to J φe (F e (x)).It must be related to the Jacobian matrix of the basis functions φ defined on the reference triangle.For this purpose, let us recall the relation φ(x) = φ e (F e (x)).Using the chain rule, we obtain Therefore, we obtain the expression J φe (F e (x)) = J φ(x)B −1 e and we note that B e is invertible provided the triangle is not degenerate.After substituting in Equation (12), we obtain where ĉ = c(F e (x)).Denoting A e = (B T e B e ) −1 , using numerical quadrature and denoting ĉq = ĉ(x q ), we end up with In the following, we shall drop the ˆand denote J q = J φ(x q ) for readability purposes.Thus, in a computer implementation, we set The number of quadrature nodes n q usually depends on the matrix.Indeed, provided the coefficients m and c are constant, the expression of the mass matrix involves a higher degree polynomial compared to the stiffness matrix and one may want to use more quadrature nodes in order to integrate it exactly.Also note that thanks to the change of variables, φ q and J q only depend on the quadrature nodes and may be precomputed once and for all.Algorithm 1 is the classical assemblage algorithm.In practice, it is not exactly this version that is implemented but a variant which consists in first computing all the local matrices, storing them and then initializing the global matrices.In the process, the local matrices may be stored in a matrix of size n p ×n p n e for instance.This avoids having to either repeatedly add nonzero entries to sparse matrices or initializing a very large matrix full of zeros which might not fit in memory.

Algorithm 1 Classical assemblage of FEM matrices
Finite element parameters: n e : number of finite elements n p : number of nodes per element n q : number of quadrature nodes per element Input: Quadrature weights {w q } nq q=1 and nodes {x q } nq q=1 potentially specific to each matrix Output: Global mass matrix M and conductivity matrix C 1: Compute φ q and J q for q = 1, 2, . . ., n q 2: Initialize M and C 3: for e = 1, 2, . . ., n e do

4:
Initialize M e and C e 5: for q = 1, 2, . . ., n q do 6: C e ← C e + w q c q J q A e J T q | det(B e )|

A reformulation of the assemblage procedure
As we will see, our approach is nothing more than a reformulation of the same assemblage procedure which allows to eliminate both for loops.The core strategy consists in not assembling the local matrices but their vectorization which allows to decouple the different dependencies and conveniently compute and store invariant quantities.

Assemblage of the mass matrix
Let us first deal with the local mass matrix.Recall that it is given by w q m q φ q φ T q | det(B e )| e = 1, 2, . . ., n e .
Let us denote Φ = [φ 1 , φ 2 , . . ., φ nq ] ∈ R np×nq and Λ e = (λ 1 , λ 2 , . . ., λ nq ) T ∈ R nq .Therefore, we obtain vec(M e ) = (Φ Φ)Λ e e = 1, 2, . . ., n e where denotes the Khatri-Rao product.For two matrices Assuming all the elements of the mesh are of the same type, the vectorizations may be computed simultaneously since the matrix (Φ Φ) does not depend on the element.Thus, we obtain Therefore, the entire assemblage procedure is compactly written as a matrix-matrix multiplication between a small matrix Q and a long matrix Λ with only a few rows but many columns.From a linear algebra point of view, the matrix Q can be seen as a basis for the vectorizations of the element matrices.Figure 2 provides an illustration of the situation.The small matrix Q can be easily computed and stored once and for all.Let us now see how Λ may be computed.We had denoted λ q = w q m q | det(B e )|.Note that if all elements are of the same type, then the quadrature weights do not depend on the element.Moreover, if an affine mapping is used, | det(B e )| does not depend on the quadrature nodes.Only the coefficients m q are both quadrature and element dependent.Making this dependency explicit, we have λ qe = m qe w q | det(B e )|.Defining the vectors w ∈ R nq and b ∈ R ne such that (w) q = w q and (b) e = | det(B e )| and the matrix S m ∈ R nq×ne containing all the coefficient evaluations, then Λ may be expressed as Λ = S m * (wb T ) (13) where * denotes the Hadamard product.The Hadamard product of two matrices A, B ∈ R n×m is their element-wise product, defined as (A * B) ij = a ij b ij .In practice, the computation of Λ does not require forming the matrix wb T explicitly and then computing the Hadamard product.Since Λ ij = w i (S m ) ij b j , the products w i (S m ) ij can be computed first by overwriting S m and then computing w i (S m ) ij b j .Although the number of finite elements n e may be extremely large in applications, the number of quadrature nodes n q is typically very small and therefore computing these matrices does not lead to storage issues.

Assemblage of the conductivity matrix
Let us now investigate how the procedure may be adapted to the assemblage of the conductivity matrix.Recall that the local conductivity matrix was expressed as Analogously to what we did previously, we first define λ q = w q c q | det(B e )|.Then, taking the vectorization, we obtain vec(C e ) = nq q=1 λ q vec(J q A e J T q ) = nq q=1 λ q (J q ⊗ J q ) vec(A e ) e = 1, 2, . . ., n e .Once again, we end up with a simple matrix-matrix multiplication.The procedure to assemble the matrix Λ is the same as before where the matrix S m must be replaced by the matrix S c containing the evaluations of the coefficient c.The matrix Q defined here is slightly larger than the one defined for the mass matrix.In fact, it contains d 2 times more columns.However, since d = 2, 3 is a small integer, the matrix Q remains very small and can be stored without any problem.We summarize this alternative assemblage procedure in Algorithm 2. Note that if the coefficients m and c are constant Line 5 can be skipped and Lines 6 and 7 reduce to Λ m = mwb T and Λ c = cwb T , respectively.Whereas the assemblage includes the formation of the global matrices (Line 12), the novelty of our approach only lies in the formation of the element matrices (Line 1 to 11).From a computational point of view, the distinction between formation of element matrices and matrix assembly may be critical and will be specified when needed.However, for the sake of the presentation, we will often simply use the word assemblage to describe both.

Algorithm 2 Alternative assemblage of FEM matrices
Finite element parameters: n e : number of finite elements n p : number of nodes per element n q : number of quadrature nodes per element Input: Vector of quadrature weights w (potentially specific to each matrix) w ∈ R nq Basis functions φ q = φ(x q ) and Jacobian matrices J q = J φ(x q ) evaluated at the quadrature nodes xq for q = 1, . . ., n q .
φ q ∈ R np and J q ∈ R np×d Output: Global mass matrix M and conductivity matrix C 12: Form the matrices M and C from V m and V c and the set of indices of nonzero entries 13: Return M and C

Consequences for nonlinear FEM
In case one of the coefficients m, c or a or the function g N depends on the unknown, the problem becomes nonlinear and different solvers must be used.To cover such an event, let us rewrite the semi-discrete approximation in Equation ( 8) making the dependencies explicit Different numerical integration schemes may be used to solve approximately the resulting nonlinear system of differential equations.For simplicity, let us consider the implicit Euler method.Denoting N the number of sub-intervals in time and ∆t = T N the step size (assumed constant), the method requires solving the following nonlinear system of equations for the unknown u s+1 for s = 0, 1, . . ., N − 1 starting from u 0 with u s ≈ u(t s ), t s = s∆t s = 0, 1, . . ., N .Defining F(u) = M (u)(u − u s ) + ∆tK(u)u − ∆tf (u, t s+1 ), we solve the nonlinear system using Newton's method for which the iterations are defined as until a prescribed tolerance is met.The sole question is how to compute the Jacobian matrix ∂F(u) ∂u .For this purpose, let us write the system component-wise We now compute ∂F i ∂u k for i, k = 1, . . ., n considering the different terms Similarly, we obtain and we defined the third order tensors Their dependency on u is understood.Before giving their explicit expressions, we note that the discrete solution u h can be expressed as u h (x, t) = φ(x) T u(t) where the vector φ(x) contains all the basis functions.Then, the tensors are expressed as Finally, Regrouping all the expressions, we obtain where • 2 denotes the multiplication of a tensor and a vector in the second mode.Note that M, K ∈ R n×n×n and M, K, B ∈ R n×n .The tensor M is symmetric in all modes, meaning that all indices can be swapped.However, the tensor K is symmetric only in the two first modes (K ijk = K jik ) due to the first term in its expression.Because of the degree of symmetry of these tensors, the multiplication can be done equivalently in the first mode.Unfortunately, when the multiplication of K and a vector is performed (in the first or second mode), the resulting matrix is nonsymmetric.Therefore, the Jacobian matrix ∂F(u) ∂u is in general nonsymmetric unless c = 0. Having to deal with third order tensors may seem very worrisome, knowing that n may reach over a million.However, the tensors are extremely sparse due to the compact support of the basis functions.Let us see how these tensors may be efficiently computed using the same arguments as for the mass and stiffness matrices.More specifically, we will be focusing on the assemblage of the mass tensor and conductivity tensor, which only considers the first term of the stiffness tensor.

Assemblage of the mass tensor
For readability purposes, let us denote m (φ T u) simply m , making the dependency on u implicit.Then, The exact same procedure described in Sections 3 and 4 may be applied to compute this quantity.The same arguments as above lead to the computation of the local mass tensor w q m q φ q • φ q • φ q | det(B e )| e = 1, 2, . . ., n e where • denotes the outer product.Denoting once more λ q = w q m q | det(B e )| and taking the vectorization, we obtain vec(M e ) = nq q=1 λ d (φ q ⊗ φ q ⊗ φ q ) = (Φ Φ Φ)Λ e e = 1, 2, . . ., n e .
Gathering all the equations for e = 1, 2, . . ., n e , and denoting We recover once more a matrix-matrix multiplication.The matrix Q is larger than the one defined for the assemblage of the mass matrix due to the fact that we are handling the vectorization of tensors.More specifically, it contains n p times more lines.Even if n 3 p might become quite large for high order finite elements in 3D, n q typically remains small and thus storing Q remains feasible.

Assemblage of the conductivity tensor
Similarly to before, we denote c (φ T u) simply c .Then, the conductivity tensor is expressed as Using the compact support of the basis functions followed by the same change of variables we obtain the local conductivity tensor After approximating it using Gaussian quadrature, dropping the ˆfor readability, and denoting A e = (B T e B e ) −1 , we set w q c q J q A e J T q • φ q | det(B e )| e = 1, 2, . . ., n e .
As usual, denoting λ q = w q c q | det(B e )| and taking the vectorization, we obtain the expression vec(C e ) = nq q=1 λ q (φ q ⊗ vec(J q A e J T q )) = nq q=1 λ q (φ q ⊗ J q ⊗ J q ) vec(A e ).
Denoting once more W = [vec(A 1 ), vec(A 2 ), . . ., vec(A ne )] ∈ R d 2 ×ne and gathering all the equations for e = 1, 2, . . ., n e we end up with the familiar equation This equation has the same structure as for the conductivity matrix.Only the definitions of the various matrices must be adjusted.
As we have seen, the assemblage of the mass and conductivity tensors does not induce any additional difficulty.Regardless of the object we are assembling, we end up with a matrix-matrix multiplication QX where Q is a small matrix that is computed once and for all and stored.This highly standardized assemblage procedure means Algorithm 2 can in fact be used for the assemblage of the mass and conductivity tensors provided the definitions of the matrices involved are adjusted.The full potential of our procedure clearly appears for nonlinear problems.Indeed, decoupling the various quantities leads to tremendous savings of floating point operations since invariant quantities are only computed once.Let us summarize our findings at this stage.
1.The computation of the nonzero entries of the finite element matrices and tensors always requires a matrix-matrix product QX. 2. For the mass matrix and tensor X = Λ. 3.For the conductivity matrix and tensor X = Λ W . 4. The matrix Λ is always expressed as Λ = S * (wb T ).
For the assemblage of a given object (matrix or tensor): 1.The matrix Q and the vector w only depend on the quadrature nodes and weights.2. The matrix W and the vector b only depend on the mesh.3. The matrix S is the only matrix to potentially depend on the unknown u.
Based on these findings, the matrix S is therefore the only matrix to be recomputed at each step of Newton's method.Consequently, only a few products have to be carried out at each iteration to compute the nonzero entries.Provided the mesh does not change at each iteration, the matrices Q and W and the vectors w and b are only computed once.The price to pay is to store a few matrices.However, these matrices do not grow prohibitively with the size of the problem.Even on our largest test cases, storage issues were never experienced.

Experiment 1
In this first experiment1 , we measure the performance of our assemblage algorithms by comparing them to FreeFEM++ [16], an open-source and highly optimized finite element computer software written in C++, a low-level programming language.FreeFEM++ is a well-established reference in the finite element community and was also used by the authors in [10,11] for performance comparison.In contrast, the implementation of Algorithms 1 and 2 is carried out in MATLAB, an interpreted high-level programming language.The unit square Ω = (0, 1) 2 is discretized by 6 increasingly fine meshes, generated using Gmsh [17], and whose mesh sizes are reported in Table 1 along with the sizes of the associated finite element matrices.The same meshes are used consistently in MATLAB and FreeFEM++ such that the matrix sizes coincide.
The computation times are compared for both P 1 and P 2 discretizations.The assemblage of finite element matrices for P 2 discretizations is more demanding both because of the greater number of nodes and number of quadrature nodes per element needed to compute the integrals.In our experiments, the number of quadrature nodes is always taken to be the minimum number such that the integrals can be computed exactly provided the coefficients are constant.All the experiments are done on MacOS using a 2.2 GHz Dual-Core Intel Core i7 processor with 8 GB of RAM.Since MATLAB does not provide a built-in function for computing the Khatri-Rao product of two matrices, we used the implementation from Laurent Sorber available on the Central File Exchange [18].

Mesh size
Matrix size for P 1 Matrix size for The average computation times over 5 consecutive launches are reported in Figures 3 to 6.For our implementation of Algorithms 1 and 2, it is now important to distinguish the assemblage of the global matrices from the formation of all element matrices.Indeed, during the course of the experiments, we noticed that the call to MATLAB's built-in sparse function at the end of the assemblage algorithm could be a major computational bottleneck.Thus, our results are reported separately with and without the function call.Unfortunately, this distinction cannot be done for FreeFEM++ unless the source code is modified, a risky meddling that requires a deep software knowledge.Thus, the computation times reported for FreeFEM++ are only for the assemblage of global matrices.Nevertheless, we believe element matrices are assembled into global matrices much more efficiently in FreeFEM++ than they are in MATLAB.Therefore, while comparing the assemblage of global matrices in FreeFEM++ to the formation of element matrices in MATLAB is certainly unfair, it does provide a very useful upper bound on the expected performance of our algorithms.On the other hand, comparing the assemblage of global matrices in both FreeFEM++ and MATLAB provides a safe lower bound on the expected performance.In other words, it gives an idea of the worst case performance.
The results reported in Figures 3 to 6 indicate that all algorithms have roughly the same complexity and scale linearly with the size of the matrices.This could be expected since Algorithm 2 is nothing more than a reformulation of Algorithm 1.As expected, FreeFEM++ is significantly faster than the standard assemblage algorithm (Algorithm 1), implemented in MATLAB.As a matter of fact, FreeFEM++ assembles the global matrices much faster than the element matrices in Algorithm 1, a clear testimony of the poor performance of embedded for loops in MATLAB.On the contrary, Algorithm 2 reserves this trend completely.It is far ahead of FreeFEM++ for the formation of element matrices and interestingly preserves a very good margin for the assemblage of global matrices of sufficiently large size.It means that Algorithm 2 in its current state, even with the inefficient sparse function, is significantly faster than FreeFEM++.We recall that it is essentially based on Khatri-Rao products and matrix-matrix multiplications for which there exists highly efficient implementations.As a performance indicator, Table 2 reports the speedup factors for the formation of element matrices and the assemblage of global matrices for P 1 discretizations.The results for P 2 discretizations are reported in Table 4.The speedup factor is defined as the ratio of computation times with respect to Algorithm 2. Since the computation time for the formation of element matrices cannot be easily retrieved in FreeFEM++, the "Formation" column actually divides the assemblage time of FreeFEM++ by the formation time of Algorithm 2 and provides an upper bound on the expected performance.The "Assemblage" column truly divides the assemblage times but since MATLAB's sparse function takes such a large toll it provides a rather pessimistic (but conservative) lower bound on performance.According to this worst case analysis, Algorithm 2 is faster than FreeFEM++ by at least a factor 2.5 to 3.5 for P 1 discretizations and 2.2 to 3.9 for P 2 discretizations (and sufficiently large matrices).In general, we expect the speedup ratio to attain its maximum over a range of medium sized matrices.On the one hand, sufficiently large matrices are required for the computation time not to be dominated anymore by the least relevant parts of the algorithm such as the computation of the position of the nonzero entries.Since these parts of the algorithm are common to Algorithms 1 and 2, our strategy cannot make a significant difference.On the other hand, when the matrices become extremely large, MATLAB is forced to switch between different levels of cache memory.Large problems require a larger but also slower cache memory.Therefore, the speedup ratio slightly decreases but remains overall very large.
For Algorithm 1, profiling reports indicated that most of the computation time was spent in the inner for loop over the quadrature nodes, as could be expected.For Algorithm 2, the bottleneck was found to be the call to MATLAB's built-in sparse function at the end of the algorithm.The sparse function's share in the overall computational expense in shown in Tables 3 and 5 for P 1 and P 2 discretizations, respectively.The toll taken by MATLAB's sparse function is noticeable qualitatively by comparing the widths of the shaded areas in Figures 3 to 6.The larger the width is, the greater the toll.It generally increases with the size of the matrix and may account for as much as 82% of the computation time.For instance, for our largest benchmark with matrices of size over 2 million, all element mass matrices are computed in 1 s whereas the global mass matrix is assembled in 5.3 s.

Experiment 2
In this second experiment, we propose to test our implementation on a nonlinear transient heat transfer problem.Computationally speaking, nonlinear problems are very demanding both because of the repeated reassemblage of finite element matrices and because of the numerous linear systems to be solved.The case study is taken from [19], where a series of validation examples for thermal finite element software are reported.The geometry is a square of side length 0.2 m, Ω = (0, 0.2) 2 .The source of non-linearity is coming from both a temperature dependent thermal conductivity and radiation boundary conditions.The material properties, geometry and boundary conditions are summarized in Table 6 and Figure 7.The thermal conductivity is a piecewise linear function such that with T 1 = 0 °C, T 2 = 200 °C, T 3 = 1000 °C and k 1 = 1.5, k 2 = 0.7, k 3 = 0.5.All thermal conductivities are expressed in W mK .The differential problem reads with coefficients h c = 10 W m 2 K and h r = σ with = 0.8 being the emissivity and σ = 5.670373 × 10 −8 W m 2 K 4 the Stefan-Boltzmann constant.The ambient temperature is set to T a = 1000 °C and the initial temperature is T 0 = 0 °C.Finally, the simulation is carried out over three hours (T = 10800 s).The time step was set to ∆t = 10 s.The geometry was discretized using P 2 finite elements and the implicit Euler method was used for the numerical integration of the nonlinear system of differential equations.The nonlinear systems were solved using the classical Newton method described in Section 5.The simulations were run independently on our in-house solver and MATLAB's PDE toolbox.A similar mesh size was used in both cases such that the meshes generated contained 7593 and 7465 nodes, respectively.The finite element solution at time t = 1990 s along with the heat flux vectors is reported in Figure 8 for illustration.To compare the results of our in-house solver with those of MATLAB's PDE toolbox, the temperature increase at the center of the domain as a function of time is reported in Figure 9.The curves overlap perfectly and validate our implementation.To compare the performance of the implementations, we have generated profiling reports.Finite element software typically involve multiple stages from pre-processing to post-processing.In order to make a fair comparison, we have restricted our attention to the computation times needed to compute the finite element solution.More specifically, we refer to it as the time spent in the solve function of the program.Although we do not know which scheme is being used in MATLAB both for the numerical integration and the solution to the nonlinear systems, we expect it to use stateof-the-art implementations.In comparison, our implementation uses a basic numerical integration scheme and the classical Newton method.The latter is known to be far too expensive and is not much used in practice.We intentionally decided to carry on with it because it provides a good test for our algorithm.On the other hand, we carefully tried to optimize the remaining part of our program.In particular, knowing that calls to MATLAB's sparse function were the bottlenecks in Algorithm 2, we tried to use it as little as possible.For this purpose, we designed linear operators to carry out matrix-vector or tensor-vector multiplications without explicitly assembling the matrix nor the tensor.Secondly, we designed an algorithm to detect the various sources of non-linearity and hence avoid unnecessary computations.The idea being to only recompute quantities which are changing.We believe MATLAB uses a similar strategy.The highly standardized assembly procedure we have described allows us to assemble all necessary matrices in a single file.Algorithm 2 is the workhorse behind the assembly procedure.We ran our simulation twice using two different strategies to solve the linear systems of equations.
1.The linear systems are solved using the backslash (\) command.
2. The linear systems are solved using an in-house implementation of GMRES [20].
The first strategy is obviously very bad since a matrix factorization will be recomputed at each step of the Newton iterations.Nevertheless, the Newton method converged very rapidly and required only 2 to 3 iterations for each nonlinear system to meet the prescribed tolerance on the norm of the increment at 10 −7 .Profiling reports indicated that in total 2757 linear systems were solved for 1080 time-steps.Despite the tremendous number of linear systems, our implementation outperformed MATLAB's PDE toolbox by a factor 2.07, the computation times being 186.1 s against 385.1 s.Due to the poor choice of solver for the nonlinear and linear systems, it is reasonable to believe that the performance of our implementation comes from the reassemblage procedure.Profiling reports indicated that about 74% of the computation time was spent solving the linear systems.The formation of element matrices only accounted for 7.6% of the computation time.
In a second experiment, we chose a slightly better strategy to solve the linear systems.We used the iterative method GMRES.This choice stems from the fact that the linear systems involve a nonsymmetric matrix in general.Despite the large matrix size, its good conditioning allowed to solve the linear systems to the prescribed tolerance of 10 −7 within 40 iterations without any preconditioning.We nevertheless kept the classical Newton method as nonlinear solver.For this simulation, our implementation outperformed MATLAB's PDE toolbox by a factor 4.94, bringing down our computation time to 78 s.In this case, the computation time was split more evenly.About 37.7% was used for solving the linear systems while 20.6% was spent in the formation of element matrices.
These experiments reveal the huge computational savings that could be achieved through the implementation of both efficient nonlinear and linear solvers combined with our reassemblage algorithm.

Limitations, potential improvements and future work
Let us briefly discuss some of the assumptions we have made.
• In Section 4, we had assumed all elements were of the same type.This allowed us to compute simultaneously the vectorization of all local matrices.Having a mesh with different types of elements is in fact not a limitation.Indeed, a mesh with different types of elements can always be split into several meshes each having a unique type of element.Therefore, the procedure we have described can be applied separately on each single mesh.The different contributions are then merged again before creating the sparse matrix.
• In Section 3, we had assumed an affine mapping was used between the reference and the current element.This meant the Jacobian matrix B e was constant.In practice, other mappings may be used such as isoparametric mappings.In this case, the Jacobian matrix not only depends on the element but also on the quadrature nodes.We can cover for this situation by only changing the definition of the matrix X.In particular, one must then compute a matrix of size n q × n e containing all the determinants and another matrix of size d 2 n q × n e containing all the vectorizations of the Jacobian matrices.
• From the very start, in Section 2, we assumed the coefficient c was a scalar.However, it may happen to be a small matrix of size d × d.This situation arises for instance in case of anisotropic thermal or hydraulic conductivities.Then, returning to point 4 in Section 3, we would have w q J q B −1 e c q B −T e J T q | det(B e )| e = 1, 2, . . ., n e .
Hence, the problem is still successfully decoupled.However, we have not yet worked out the implementation details as anisotropic conductivities are not frequently encountered in practice.
• For nonlinear problems where the mesh is changing from one iteration to the next, mesh dependent vectors and matrices will clearly have to be recomputed as well.Provided the algorithms for such tasks are efficient, it should only have a moderate impact on performance.
Despite the already tremendous savings, there is still room for improvements.Let us list a few ideas.
• The finite element matrices or tensors we have considered so far always have a certain degree of symmetry.The symmetry naturally carries over to the local matrices and tensors.As we are computing their vectorizations, we straightforwardly know that some of the elements of these vectors will be equal.Having expressed their vectorizations as V = QX allows to avoid unnecessary computations by simply eliminating a few rows of the matrix Q that will only generate duplicates when the product QX is computed.Not only does it induce savings in terms of floating point operations but also in terms of storage because the matrix Q we are storing is smaller than the original one.However, in our implementations, we did not bother with such considerations because neither the product QX nor the storage were the bottlenecks.
• The formation of the element mass and conductivity matrices is done sequentially in Algorithm 2 to highlight the similarities in the assembly procedure.However, the formation of all element matrices for each global matrix can naturally be done in parallel.Moreover, the formation of element matrices heavily relies on matrix-matrix operations (level 3 BLAS) that are also well-suited for parallel computations.
• As already noted, if one is not careful enough, much of the computation time for nonlinear problems is spent on repeated calls to MATLAB's built-in sparse function.In our implementation, we attempted to reduce the number of calls to a minimum by designing linear operators for carrying out matrix-vector and tensor-vector multiplications.However, we eventually had to form the tangent matrix explicitly in order to use sparse direct solvers (which are used when calling the backslash command).Using GMRES instead allowed to avoid entirely calls to the sparse function.However, it was not always advantageous to do so.The reason being that our linear operators could not perform a matrix-vector multiplication as fast as MATLAB could do a sparse matrix-vector multiplication.The use of linear operators was only advantageous when the matrices were very large and GMRES converged very rapidly.
Moreover, one might reasonably question whether it is worthwhile assembling the multiplication between tensors and vectors instead of first assembling the tensor and then computing the multiplication.There are two reasons behind this choice: 1.Although our procedure successfully applied to M • 2 (u − u n ), it did not for K • 2 u.We were unsuccessful in decoupling the terms as we had done for all other quantities.On the other hand, our procedure extended straightforwardly to the assemblage of M and C.
2. It may happen in applications that the coefficients, although dependent on the unknown, are not too complicated functions.In particular, if the function is linear, then the corresponding tensor is constant and therefore it will only be assembled once.Whereas the multiplication between the tensor and the vector will have to be repeatedly reassembled.
In future work, we will attempt to tackle the following points: • Several interpreted programming languages already offer technologies to speed up critical loops.Python's Numba and MATLAB's Just-In-Time compilation and MEX function generation are just some examples.It might be worthwhile assessing in future work how our strategy compares with existing technologies.Benchmarking would especially target nonlinear problems.Indeed, as we have seen, some of the matrices needed in the assembly process are only computed once, inducing savings in floating point operations.
• We have already successfully extended our algorithm to the vector PDE of linear elasticity.
For the mass matrix, the extension is straightforward due to its Kronecker product structure [21].The stiffness matrix, on the other hand, is much more challenging and we plan to present the algorithm is an upcoming publication.
• Although all the concepts we have seen apply to 3D problems, the implementation is surely more troublesome.Indeed, dealing with 2D problems allowed us to compute the invariant matrices needed in the algorithm very cheaply.For instance, the vector b could be computed by simply using the formula for the determinant of 2×2 matrices along with element-wise operations defined in MATLAB.For 3D problems, these computations become more burdensome.It is therefore worthwhile investigating if these matrices can still be computed efficiently.This is especially important for linear problems.For nonlinear problems, the computation of these vectors and matrices will not play a major role in the overall computational expense since they are usually only computed once.

Conclusion
In this paper, we have investigated an alternative procedure for assembling finite element matrices and tensors.Although entirely equivalent from a mathematical point of view, our approach allows to design a completely loop-free algorithm.It is clear that the benefits of this approach are especially visible for interpreted programming languages such as MATLAB and Python.But they might hold as well for other languages.Indeed, our approach allows to decouple the various quantities needed in the assemblage of the matrices and therefore avoids having to recompute unnecessary quantities.This strategy becomes particularly appealing for nonlinear problems, when finite element matrices must be reassembled numerous times.Much of the computational workload is concentrated in a few Khatri-Rao products and a small matrix-matrix multiplication.Thanks to its reliance on standard and highly optimized linear algebra operations, our method achieves a tremendous speedup in comparison to traditional assemblage strategies.Finally, at least from a conceptual perspective, our approach is not subjected to any major limitation.

Te m φ j
φ i dΩ i, j ∈ N e e = 1, 2, . . ., n e .The cardinality of N e is equal to the number of nodes of the finite element.Let us denote it n p = |N e |.Another way of viewing it is to define the local matrices M e ∈ R np×np such that M e = Te m φ e φ T e dΩ e = 1, 2, . . ., n e t e x i t s h a 1 _ b a s e 6 4 = " Y 4 o B s W k 6 D K U e n N d Z 6 4 o z n z m C P 3 D e f g A E S J F T < / l a t e x i t > T e < l a t e x i t s h a 1 _ b a s e 6 4 = " w 2 A b m n w 7 i b s H A h n K s e n T s r G q x w U = " > A A A B 7 n i c b Z C 7 S g N B F I b P e o 3 r L

Figure 1 :
Figure 1: Affine transformation between the reference and the current triangle

M
(N e , N e ) ← M (N e , N e ) + M e 10: C(N e , N e ) ← C(N e , N e ) + C e 11: end for 12: Return M and C

Figure 2 :
Figure 2: Illustration of the matrix sizes for the product QΛ

n 2 p ×d 2 nq 4 :
Compute the vector b and matrix W from mesh related data.b ∈ R ne , W ∈ R d 2 ×ne 5: Compute the matrices S m and S c containing coefficient evaluations S

Figure 3 :Figure 4 :
Figure 3: Computation times for the formation and assemblage of the mass matrix using P1 finite elements

Figure 7 :
Figure 7: Geometry and boundary conditions

Figure 8 :Figure 9 :
Figure 8: Finite element solution at time t = 1990 s

Table 1 :
Mesh and matrix sizes

Table 2 :
Speedup factors for Algorithm 1 (Algo 1) and FreeFEM++ (FF++) for the formation of element matrices and assemblage of global matrices using P1 finite elements.

Table 3 :
Time fraction spent in the sparse function for the assemblage of global matrices using P1 finite elements

Table 4 :
Speedup factors for Algorithm 1 (Algo 1) and FreeFEM++ (FF++) for the formation of element matrices and assemblage of global matrices using P2 finite elements.

Table 5 :
Time fraction spent in the sparse function for the assemblage of global matrices using P2 finite elements

Table 6
2, . . ., n e which would lead to expressing the local conductivity matrix as