Semi-analytic integration for a parallel space-time boundary element method modeling the heat equation

The presented paper concentrates on the boundary element method (BEM) for the heat equation in three spatial dimensions. In particular, we deal with tensor product space-time meshes allowing for quadrature schemes analytic in time and numerical in space. The spatial integrals can be treated by standard BEM techniques known from three dimensional stationary problems. The contribution of the paper is twofold. First, we provide temporal antiderivatives of the heat kernel necessary for the assembly of BEM matrices and the evaluation of the representation formula. Secondly, the presented approach has been implemented in a publicly available library besthea allowing researchers to reuse the formulae and BEM routines straightaway. The results are validated by numerical experiments in an HPC environment.


Introduction
For a bounded Lipschitz domain Ω ⊂ R 3 we aim to solve the heat equation respectively. Such initial boundary value problems can be solved by boundary element methods. A survey on discretisation methods involving boundary integral equations is given in [1]. Here we consider a space-time formulation and a Galerkin method for discretisation. A comprehensive analysis of the involved integral equations is given in [2]. Error analysis for the Galerkin method has been provided in [2,3,4,5,6]. A space-time formulation has certain advantages with respect to adaptivity and parallelisation. It allows quite general adaptivity in space and time compared to time stepping and convolution quadrature methods. A common parallelisation in space can be enhanced by an additional parallelisation with respect to time, which is not possible for timestepping approaches.
We aim to provide a complete and (hopefully) error-free presentation of details on the implementation of a Galerkin boundary element method for the three-dimensional heat equation considering all boundary integral operators. Galerkin methods have been considered in, e.g., [2,3,6,7] for 2d and [8,9,10] for 3d. Typically, implementational aspects are discussed only briefly and a lot of effort is necessary to transform the theoretical results into a performant computer code. Our aim is to remove this setback by providing a detailed discussion and a publicly available C ++ library.
In case of space-time tensor product discretisations the integrals with respect to time can be carried out analytically. This may result in a significant reduction of computational times. In [3], aspects of the temporal integration are discussed and [8] considers the 3d setting. Unfortunately such presentations are typically very brief and we try to fill the gap by a detailed discussion.
The paper is organised as follows. In Section 2 we introduce the considered space-time boundary integral equations. The discretisation by the boundary element method is provided in Section 3 together with the derivation of heat kernel antiderivatives necessary for the assembly of system matrices and the evaluation of the representation formula. For a later reference by an interested reader using the presented results we provide a short summary of the formulae in Section 4. Section 5 describes the implementation approach as provided in the besthea C ++ library [11]. We validate the result by numerical experiments in Section 6 and conclude in Section 7.

Boundary integral equations
The solution to the initial problem (1.1)-(1.2) is given by the representation formula with the single-layer potential α ∂G α ∂n y (x − y, t − τ )u(y, τ ) ds y dτ, the fundamental solution to the heat equation for t > τ, 0 otherwise, and its scaled normal derivative for t > τ, 0 otherwise.
The operators V and W are well-defined in the setting of anisotropic Sobolev spaces, see e.g. [12,13] for a definition of such spaces. In particular, natural choices are X := H 1/2,1/4 (Σ) for the space of the Dirichlet datum u and its dual X ′ := H −1/2,−1/4 (Σ) for the Neumann datum w := α∂u/∂n. By applying the Dirichlet and Neumann trace operators to the representation formula (2.1) we get the boundary integral equations [1,2,3,5] V w(x, t) = 1 2 respectively. The boundary integral operators V , K, D, and K ′ T satisfy where the integral representations on the right hold for sufficiently regular functions. The above boundary integral equations are equivalent to the variational formulations with the duality pairing ·, · Σ between X ′ and X given by the continuous extension of v, w Σ : For the duality pairing with the hypersingular operator we have an alternative representation removing the non-integrable singularity, namely [2,14] Du, where the surface curl of a sufficiently regular function u is defined by for a suitable extension u of u to an open neighbourhood of Σ. If the functions u and r are regular enough, (2.4) admits an integral representation. We will consider such a representation in the discrete setting in Section 3.3.
To solve an initial boundary value problem for the heat equation (1.1) with homogeneous initial conditions and prescribed Dirichlet or Neumann boundary data, it suffices to determine the unknown boundary data. Then we can use the representation formula (2.1) to recover the solution. In the case of a Dirichlet boundary value problem the Neumann datum w can be determined from (2.2) while in the case of a Neumann boundary value problem the Dirichlet datum u satisfies (2.3). It is shown in [2,3] that these variational formulations admit a unique solution. In the next section we deal with their discretisation.

Boundary element method
For the discretisation of the variational formulations (2.2) and (2.3) we need a discretisation of the space-time boundary Σ. We restrict our attention to tensor product space-time discretisations Σ h with uniform time steps. For a given uniform decomposition of the time interval and an admissible triangular mesh Γ h , which approximates Γ := ∂Ω and is given by γ j with γ j denoting planar triangular elements, we define the space-time mesh On Σ h we construct approximating spaces X 1,0 h ⊂ X and X 0,0 h ⊂ X ′ accordingly as tensor products, i.e. as linear combinations of functions whose spatial and temporal contributions can be separated as We thus define the space of functions piecewise constant both in space and time and the space of functions globally continuous and piecewise linear in space and piecewise constant in time.
Here we denote by N x the number of nodes of the triangular mesh Γ h . A function u h in X 1,0 h admits the representation where the first index of the coefficient u i,j is associated with time and the second one with space. This notation is slightly inconsistent with respect to the naming convention classically used for the function spaces, where the first superscript is related to space and the second one to time. However, for the implementation and representation of the matrices it is more natural to sort with respect to time first, which is why we use this notation.
To discretise the variational formulations (2.2) and (2.3) we replace the functions in X and X ′ with their discrete counterparts in X 1,0 h and X 0,0 h respectively. In the following subsections we give more details about the resulting discrete operators. In particular, we focus on the computation of the corresponding integrals.

Single-layer boundary integral operator
We start with the discretisation of the bilinear form V w, q Σ . By replacing w with the approximation x,j and testing with a basis function Here we changed the order of the integrals, which is justified by Fubini's theorem and the fact that G α is Lebesgue integrable on Σ h × Σ h , which follows from the estimate [15, Ch. 13 §3] Since the fundamental solution depends only on the difference t − τ and the considered decomposition of the time interval is uniform, the double temporal integrals depend only on the difference d := k − i. The duality pairing with all basis functions thus leads to the block Toeplitz matrix vector product with vector components w d j := w d+1,j and the spatial matrix blocks defined by To set up V h we use analytic integration in time and a regularised quadrature in space as used in stationary problems. The details are given in the following paragraphs.
Temporal antiderivatives: We start with the latter. Integrating with respect to τ leads to (3.5) and the error function Continuing in the integration we obtain for δ > 0 and r > 0. For V 0 (r) we obtain in a similar fashion Thus the integrals in (3.3) are linear combinations of the integrals Notice that a contribution with a fixed δ can be reused to assemble V d h for several values of d.

Stable Evaluations of V d (r) for special cases:
We have to provide stable alternatives of (3.5) and (3.7) for cases where the standard form does not allow an evaluation by a computer. In (3.6) with d = 1 and (3.
for r > 0. Similarly, we have to consider the limit for r > 0.

Computation of the Galerkin weights of V d h :
We have to compute the spatial integrals of (3.9) and (3.10). For δ > 0 the integrand in (3.9) is smooth. Therefore, standard quadrature routines can be applied to evaluate where we make use of the standard mapping to a reference elementγ. Here we denote the composition of the mapping and the kernel function by G dτ dt α and the surface area of a triangle γ j by ∆ j .
The integrand G dτ α (x − y, 0) of (3.10) given in (3.11) has a singularity like the Laplace kernel for x = y. Thus we can use standard quadrature routines only if the triangles γ ℓ and γ j are separated. If instead γ ℓ and γ j have non-empty intersection, i.e. they share a vertex or an edge or are identical, we use regularised quadrature techniques based on the Duffy substitution [16,17]. The integral (3.10) then transforms to an integral of the type Analogously we deal with the integrals in (3.9) for δ = 0. Although in that case the function r → G dτ dt α (r, 0) does not have a pole at r = 0 it is still not smooth and we use the regularised quadrature for intersecting triangles as well. This also unifies the implementation for other kernels possibly singular in this case.
If we use discrete test and trial functions with higher polynomial degree in space for the discretisation of the bilinear form V u, q , e.g. u h , q h ∈ X 1,0 h , the computation of the matrix entries follows the same lines. In particular, the matrix entries of the d-th block are given by

Double-layer boundary integral operator
For the discretisation of Ku, q Σ we replace u with its approximation u h in X 1,0 h , i.e.
By testing with the basis function Again, we changed the order of integration using Fubini's theorem, which is applicable since similarly as in [18,Sect. 8.2.2], because the spatial boundary Γ h is piecewise smooth. An analogous estimate to (3.14) holds for |∇ y G α |. This allows us to exchange the spatial gradient and the time integrals for x − y = 0 [19, Prop. 5.9] which yields Temporal antiderivatives: As before we analytically evaluate the integrals For d > 0 we obtain from (3.6) that Since G dτ dt α depends only on the norm of its first argument, we can write

Collecting all intermediate steps brings us to
Similarly as in (3.17) with (3.16) we have Thus, we obtain Stable Evaluations of K d (r) for special cases: For a stable evaluation of (3.15) for d = 1, we evaluate (3.18) for δ = 0 and r > 0 by Conversely, for δ > 0 the value of α(∂G dτ dt α /∂n y )(r, δ) in r = 0 is given by This follows by estimating and observing (e.g. using L'Hospital's rule) that the limit ofg dτ dt Computation of the Galerkin weights of K d h : The layout of the block matrix K h is the same as the one of V h in (3.2), i.e.
The individual blocks K d h are set up as where the integrals are evaluated in the same way as the integrals we considered for the singlelayer operator. In particular, we use the same regularisation technique. This time, we have to deal with a singularity similar to the one of the double-layer boundary integral operator of the Laplacian, see (3.21). Remark (The Galerkin matrix of the operator K ′ T ). The matrix K ⊤x h related to the operator K ′ T is obtained from K h by transposing each block in (3.22) separately, not the matrix as a whole.

Hypersingular boundary integral operator
For functions u h and r h in X 1,0 h one can show that the right-hand side of (2.4) and therefore the bilinear form Du h , r h Σ h admits the following weakly singular integral representation (see [14], compare also [8,5]) 23) where u h (y, t n−1 +) denotes the right limit of u h with respect to time in t n−1 .
By inserting x,j and testing with the basis function (3.24) Changing the order of the integrals is justified as before by Fubini's theorem. Indeed, for the integrals in the first two lines we can argue as in the case of the single-layer operator in Section 3.1. For the integrals in the fourth line it suffices to observe that G α is Lebesgue integrable on Σ h × Γ h which follows from (3.1). Similarly, for the integrals in the third line we note that

Temporal antiderivatives:
For the first summands in (3.24) we know the temporal antiderivatives from the single-layer boundary integral operator, see (3.6) and (3.8).
For the second part in (3.24) we analytically evaluate the integrals For d > 0 we can write

Stable Evaluations of D 2,d (r) for special cases:
The values G dt α (r, 0) in (3.25) for r = 0 are obtained by the limit in (3.11). Additionally we have to treat the limit for δ > 0, to evaluate G dt α (r, δ) in r = 0 in a stable way.

Galerkin matrix D h :
The Galerkin matrix D h possesses the same layout as (3.2) and can be split into h we have just computed the temporal antiderivatives. Its blocks D 2,d h are set up as Again, these integrals are handled in the same way as those of the single-layer operator. For the matrix D 1 h emerging from the first two lines of (3.24) we can make use of V h . Since the surface curls of spatially piecewise linear functions are piecewise constant on triangles, we can rewrite any of the summands in the first part of (3.24) as

Boundary integral equations and systems of linear equations
To solve Dirichlet initial boundary value problems, we consider a Galerkin variational formulation of the weakly singular boundary integral equation (2.2) with u = g and end up with the system of linear equations with an L 2 (Σ h ) projection of the Dirichlet data g into X 1,0 h . We have described the matrices V h and K h in Sections 3.1-3.2. The mass matrix M h realises the identity in (2.2) and has the form

For a Neumann initial boundary value problem we solve the Galerkin variational formulation of the hypersingular boundary integral equation (2.3) with w = h. The related system of linear equations is
with an L 2 (Σ h ) projection of the Neumann data h into X 0,0 h . We have presented the matrix D h in Section 3.3. In addition we need to assemble the matrices K h and M h . The blockwise transposition can be realised in the application of the matrices.

Single-and double-layer potential
To evaluate the discretised representation formula (2.1) in x ∈ Ω and t k + ε = kh t + ε with ε ∈ [0, h t ) we have to compute the contribution of the single-layer potential with the antiderivative G dτ α known from (3.5) and its limit for δ → 0 + in (3.11). Since we evaluate the potential in points x which are not on Γ h , all integrands are smooth and standard quadrature can be used to compute all integrals.

Summary
For a better readability and to provide a reference to a reader implementing the method we provide a summary of the developed formulae below.
With uniform time steps all matrices A h ∈ {V h , K h , K ⊤x h , D h } possess a block Toeplitz structure The hypersingular operator matrix is built as

Implementation
In this section we discuss an implementation strategy for the assembly of the single-layer matrix V h . All other BEM matrices can be treated analogously. The computationally most intensive part is the evaluation of the antiderivatives G dτ and G dτ dt . Indeed, in addition to the evaluation of the distance between spatial coordinates x − y , which is the most time consuming part of the BEM assembly for the Laplace equation, one has to evaluate the exponential and error functions in many quadrature points for all blocks of the Toeplitz matrix. The implementation strategy in shared memory thus follows the ideas presented by the authors previously for 3d space and 2d space-time BEM in [20,21,22]. To make use of modern multicore processors with vector arithmetic units we make use of features of modern OpenMP [23], namely threading and SIMD vectorisation. The source code of the library besthea implemented by the authors is publicly available [11].

Assembly of blocks
The naive approach to assemble the Toeplitz matrix (3.2) would be to assemble blocks V d h one by one, i.e. loop over the parameter d. Looking at (3.6), this would mean that G dτ dt (·, δ) would have to be evaluated multiple times in all spatial quadrature points for a fixed δ and different values of d. E.g., for δ = 2h t the same kernel would have to be evaluated for all blocks with d ∈ {1, 2, 3}. Taking into account the uniform discretisation of the time interval one can instead loop over i t := δ/h t and thus evaluate the costly kernel once only. In Tables 5.1 and 5.2 we summarise the relation between d and i t , i.e. we state to which blocks V d h the kernels G dτ dt (·, i t h t ) contribute and vice versa. Note that a similar strategy can also be applied to evaluate the single-and double-layer potentials given in Section 3.5.  A sketch of the matrix assembly code is given in Listing 5.1. As pointed out above, we loop over the variable i t = δ/h t and continue with visiting all test and trial spatial elements (triangles). The loop over test triangles is distributed among available OpenMP threads in a dynamic fashion. For each pair of elements the functions evaluate_kernel and add_to_matrix are called to assemble the local contribution and add it to the global matrix multiplied with the test and trial basis functions. We give more details about these procedures in the next subsection.

Local contributions
To exploit the full potential of floating point units we vectorise the code at the level of local contributions to the global matrix. For simplicity we opt for the OpenMP implementation of vector processing similarly as in [20,21,22].
Looking back on (3.13), we approximate the regularised integrals of the type   In both cases we collapse the sums, or loop, into a single one to make the vector of quadrature points as long as possible to evaluate the kernel function efficiently. This is shown in Listing 5.2, where the temporal antiderivative is evaluated. The variable size corresponds to M 4 and N 2 from (5.1) and (5.2), respectively. The OpenMP pragma tells the compiler that SIMD vectorisation should be used, that all the underlying arrays are aligned at the 64-byte boundary, and that the vector size should be 8 (we assume double precision arithmetic and AVX512 instruction set extension). Notice that we also make use of the structure of arrays concept separating coordinates of the quadrature nodes into separate arrays x1, x2, x3, . . . , to ensure unit strided access to data. Earlier work [20,21,22] has shown that this approach is more efficient than an array of vectors.
After performing and storing the kernel evaluations in kernel by evaluate_kernel for the current pair of elements, we evaluate the test and trial basis functions as shown in Listing 5.3, multiply with kernel and add value to the respective spatial and temporal indices in the global matrix. Here we use the mapping from Table 5.1. The multiplier for value is determined from (3.6). Again we make use of vectorisation. The add_atomic function makes use of the OpenMP atomic clause to avoid data races between individual threads.

Numerical experiments
In this section we perform numerical experiments validating the presented approach both in terms of convergence and scalability in shared memory. The experiments have been performed at the Barbora supercomputer at IT4Innovations National Supercomputing Center, Czech Republic.

Convergence
First of all, we check that the presented semi-analytic evaluation of the integrals and its implementation in [11] is correct. To that end we consider the initial problem (1.1)-(1.2) with the heat capacity constant α = 0.5 and zero initial conditions in the space-time domain Q := (−1, 1) 3 × (0, 1). We choose the solution u(x, t) = G α (x − y * , t) with y * := (0, 0, 1.5) ⊤ , which allows us to validate our numerical approximation. We consider both the Dirichlet problem with the prescribed boundary datum and the Neumann problem with We have described details of the applied Galerkin methods in Section 3. In addition we use the matrix V 11 h as the essential part of an operator preconditioner for D h [25,26,7], where V 11 h is the realisation of the single-layer operator for functions piecewise linear and globally continuous in space. We check the convergence of the approximations u h and w h corresponding to u and w to the known Cauchy data on a sequence of uniformly refined meshes Σ h .
We consider only tensor product meshes Σ h in this paper. The coarsest one is formed by a surface mesh consisting of 192 triangular elements, i.e. 32 congruent triangles on each face of the cube, and a partition of the time interval (0, 1) into 8 time steps. At each refinement level we quadrisect all triangles and bisect the time steps, i.e. we keep h x ≈ h t . The solution of the BEM system is computed by the FGMRES [27] method with a relative accuracy of 10 −8 .
In Tables 6.1, 6.2 we provide the convergence results. In the first two columns, E t and E x denote the number of elements in time and space, respectively. The columns labelled with L 2 (Σ h ) contain the relative errors These integrals are evaluated using standard tensor product quadrature rules in space and time of sufficiently high orders. The estimated order of convergence provided in the columns denoted by eoc is computed as For comparison, Tables 6.1 and 6.2 contain not only the results for the computed approximations u h and w h , but also for the L 2 (Σ h ) projections u * h and w * h of the known solution defined by In the last two columns we present convergence results for the evaluation of the representation formula. For this purpose the representation formula was evaluated in 10 4 nodes ( x j , t j ) distributed in [−0.5, 0.5] 3 × [0.25, 0.75]. The columns labelled with ℓ 2 contain the relative errors Let us shortly comment on the results in Tables 6.1 and 6.2. In both tables, the L 2 error of the computed approximation follows the best possible error, which is attained by the respective projections. In the case of the Dirichlet problem, the estimated orders of convergence of the L 2 errors vary quite a lot. Asymptotically we would expect at least an order of 0.75, while previous  examples indicated that an order of 1 can be attained [5,Thm. 7.4 and Sect. 8.2]. Even though we are probably still in a preasymptotic regime due to the relatively small number of unknowns which we consider limited by the use of a standard, non-compressed BEM, our computations agree with these expectations. For the evaluation error inside the domain we expect and observe a quadratic convergence order [5,Eq. (7.5)]. Also in the case of the Neumann problem our results agree with the theory. We expect and observe convergence order 1 for the L 2 error of the Dirichlet datum [6, Eq. (7.16)] and order 1.5 for the evaluation error [6,Sect. 7.2.2]. A different refinement strategy of two subdivisioning steps in time with one spatial refinement step would provide better convergence rates. In total, we observe expected convergence behaviours, which indicate the correctness of the developed and implemented quadrature routines.

Scalability
The scalability of the besthea library [11] has been tested on the same example as before, but on a fixed mesh with 3072 spatial boundary elements, 32 time steps, and the representation formula was evaluated in 1089 · 32 = 34848 space-time points. The library and examples were compiled by the Intel Compiler 19.0.5.281 with the flags -O3 -qopenmp -xcore-avx512 -qoptzmm-usage=high to make use of the AVX512 instruction set available on the 18-core Intel Xeon Gold 6240 CPU at the Barbora supercomputer. The nodes are configured as dual socket, i.e. every node consists of two such CPUs.
The baseline for our experiments is given by the performance on a single thread. The number of threads is controlled by the KMP_HW_SUBSET environment variable. When using up to 18 threads (a single socket) we set it to KMP_HW_SUBSET=1s,Xc with X denoting the number of threads. This ensures that the threads stay within a single socket. To use all 36 threads we set KMP_HW_SUBSET =2s,18c. In Table 6.3 we provide the assembly times of BEM matrices and the efficiency of the code. One can see that the efficiency stays above 90 % within a single socket with piecewise constant basis functions, where the contribution to the global matrix does not require atomic addition. For piecewise linear functions there are memory conflicts between individual threads and the efficiency is reduced, although the numbers stay reasonable. When crossing the socket and utilising all 36 cores the assembly times are further reduced, although the efficiency drops. This is caused by the fact that the matrix data is stored in std::vector with the standard allocator which is not NUMA aware. The first touch policy thus cannot be easily applied and threads can access memory across sockets. This effect could be alleviated by a different storage structure or a raw array. However, the future aim of the besthea library is to use the fast multipole method parallelised in distributed memory via MPI and assign a single process per socket.
Similarly, Table 6.4 provides scalability results for the evaluation of the single-and doublelayer potentials, see Section 3.5. Again, the efficiency is above 90 % when the whole socket is populated and drops when accessing both sockets.

Conclusion
The aim of the paper was to provide the readers with semi-analytical formulae for the assembly of boundary element matrices and the evaluation of the representation formula for the heat equation in three spatial dimensions. Throughout the paper a uniform discretisation of the timeline is chosen for simplicity, however, the same antiderivatives can be used on non-uniform grids. Moreover, the approach has been implemented in the publicly available C ++ library besthea [11] and thus our results can be used in further BEM projects. In the numerical experiments we have validated that the formulae deliver the expected results.
The provided implementation supplies a fast computation of the entries of the Galerkin matrices using threading and vectorisation. However it does not scale optimally across NUMA nodes (sockets). This is due to the fact that the main aim of the library is to provide boundary element methods accelerated by the fast multipole method (FMM) and parallelised in distributed memory. The formulae provided here will be used for the near field entries only, thus further optimisation of the full assembly is not planned (also taking into account the massive memory requirements).
With FMM, a single MPI process will be assigned to a single socket and thus the NUMA effects will be automatically overcome.