An accurate and time-parallel rational exponential integrator for hyperbolic and oscillatory PDEs

Rational exponential integrators (REXI) are a class of numerical methods that are well suited for the time integration of linear partial differential equations with imaginary eigenvalues. Since these methods can be parallelized in time (in addition to the spatial parallelization that is commonly performed) they are well suited to exploit modern high performance computing systems. In this paper, we propose a novel REXI scheme that drastically improves accuracy and efficiency. The chosen approach will also allow us to easily determine how many terms are required in the approximation in order to obtain accurate results. We provide comparative numerical simulations for a shallow water equation that highlight the efficiency of our approach and demonstrate that REXI schemes can be efficiently implemented on graphic processing units.


Introduction
In this work, we are interested in simulating linear partial differential equations (PDEs) with purely imaginary eigenvalues of large modulus (i.e. stiff problems). That is, we consider where σ(A) is the spectrum of A. Such problems arise, for example, in quantum dynamics (e.g. the Schrödinger equation) and wave propagation (e.g. the Helmholtz equation). In addition, solving these problems is an integral part of applying exponential integrators or splitting methods to a large number of nonlinear problems ranging from plasma physics to electrodynamics. Due to the stiff nature of these equations, explicit time stepping methods are forced to take excessively small time steps in order to remain stable. Thus, implicit schemes $ The numerical computations are performed using the GPU cluster GPU3 at the University of Innsbruck.
Approximately a decade ago, the hardware used to run such simulations has undergone a paradigm shift. Due to the fact that frequency scaling has essentially ended at that point, the main way to increase performance has been to add more parallelism. Desktop computers now routinely have 16 cores and large vectorization units. This trend is even more pronounced in high performance computing systems, where supercomputers with millions of threads are now in operation. In addition, graphic processing units (GPUs) have come to the forefront as they are able to outperform central processing units (CPUs) for many scientific computing tasks. This advantage is achieved by providing a massive parallel system. In fact, a sequential program on a GPU would be slower than on a CPU. Therefore, methods which are highly parallelizable and work well on these new computer architectures are needed to take advantage of their computational power. A significant body of research has been accumulated in recent years that considers numerical methods that are well suited for such systems (see, e.g., [15,16,21,27,28]). More specifically, in the context of exponential integrators we refer to [17,18].
Since time stepping methods are inherently sequential, they generally can not be parallelized in time (although some parallelism can be extracted by constructing methods with parallel stages; see, e.g., [25,29]). Even exponential integrators of high order, which are able to perform large time steps, are mainly implemented sequentially (in time). For example, polynomial approximations, as in [6,10,32], require us to calculate a sequence of matrix-vector products which can not be done in parallel. A similar argument holds for Krylov approximations [33]. Of course, these methods are parallelizable in space. But in some situations the scalability is limited and, at some point, increasing the number of computing cores does not further reduce the simulation time [34]. Therefore, such an approach can not fully exploit modern computer hardware.
A novel idea to overcome this problem are so-called Rational Exponential Integrators (REXI) schemes, which were introduced in [22]. The basic idea of these methods is to approximate e tA by a linear combination of simple rational functions. The advantage of REXI methods is that the corresponding terms can be calculated independently of each other. Therefore, these methods are highly parallelizable in time. It is worth mentioning that REXI is markedly distinct from time parallelization schemes such as the parareal and similar methods (see, e.g., [19]). In the latter case, a coarse and a fine time integrator are combined to achieve parallelism within an iterative procedure, while in the former the action of the matrix exponential is directly approximated in a way that is amendable to parallelization.
In this paper our goal is twofold. First, we propose a modification to the original REXI scheme that drastically improves the accuracy and efficiency of the method (section 2). The proposed method also allows us to easily determine how many terms the approximation requires in order to obtain accurate results. These theoretical considerations are then confirmed by numerical experiments in section 3. Second, we demonstrate that these types of methods can be efficiently implemented on massively parallel computer architectures. Specifically, we demonstrate an implementation on modern GPUs that yields a drastic speedup compared to the corresponding CPU implementation for the shallow water equations (section 4).

The original and improved REXI schemes
In this section we discuss the derivation of the REXI schemes. Moreover, we reveal some problems of the original scheme in the matrix case and show how they can be eliminated with our new formulation.

The scalar case
In this section we will give a brief summary of how REXI approximates e ix with x ∈ R. For more details, see [22,34]. Our notation is the same as in [34]. The three main steps are as follows: 1. Approximate e ix by a sum of Gaussian functions. 2. Approximate each Gaussian function by a sum of rational functions. 3. Combine 1. and 2. to approximate e ix by a sum of rational functions.
Step 1 We start by writing e ix as a linear combination of Gaussian functions [26] e where and = e h 2 k =0 e −4π 2 k 2 , see Appendix B. Since , in general, is smaller than machine precision (see Appendix B or [26]), we neglect it in the following.
The parameter h defines the numerical support of the Gaussian function. Our goal now is to determine the coefficients b m . To that end equation (1) is transformed to Fourier space. Using the shift property of the Fourier transform, it follows that and therefore we obtain Since and therefore the sought after coefficients are given by Here, the first constraint on h arises. If h > π then the integral in (3) is 0. In practice, this parameter has to be even smaller to produce accurate approximations. Its value is discussed in Section 3.1. Finally, the sum in (1) has to be truncated: Therefore, REXI depends on two parameters: M and h. As mentioned earlier, the parameter h < π defines the numerical support of (2) and the parameter M controls, together with h, the interval on which the approximation is sought. It can be shown ( [22] and Appendix B), that (6) produces an accurate approximation if Step 2 The second step consists in approximating the Gaussian function ψ h (x) as a sum of rational functions: where µ ∈ R and a l ∈ C are coefficients. To determine these parameters, the authors in [22] first approximateψ 1 (ω) by a linear combination of exponential functions, using the Adamyan-Arov-Krein theory (see [14] for more details). Moving back to physical space, they then obtain where Re(θ j ) < 0. In the original REXI scheme [22] only the coefficients θ j are computed. Then, they set µ = min j Re(θ j /2π) and then look for an approximation of ψ 1 (x) of the form (8) with h = 1, where L = 11 is chosen. This slight modification of (9) gives the advantage that all the shifted Gaussians share the same poles, which reduces the terms in the next step. Then, they determine the coefficients a l by minimizing the following l ∞ error using a set of points x k ∈ [− 30,30]. In Table 1 of [22] the values of the coefficients a l and µ are given. The corresponding approximation error is less than 7 × 10 −13 . We now propose a strategy to reduce the approximation error by using a different approach to determine the coefficients. First, we observe that the function ψ 1 (x) is a symmetric function, i.e. ψ 1 (x) = ψ 1 (−x). Therefore, the same property should hold true for the corresponding approximation R(x), which is not satisfied in [22]. We remark that R(x) is symmetric if a l = a −l for all l. To enforce that this property holds also numerically, we rewrite (10) as Then, to find the coefficients a l we minimize the following l 2 error on K points x 1 , . . . , x K . Since R(x) is a linear combination of the coefficients, the approximation can be written as where G(x k , µ, L) ∈ R 1×(2L+1) and y ∈ R 2L+1 is of the form y = [a 0 , Re(a 1 ), Re(a 2 ), . . . , Re(a L ), Im(a 1 ), . . . , Im(a L )] T .
The reason why we chose to minimize (13) instead of (11) is because it can be solved easier and faster. To minimize (13) we compute the least square solution of the corresponding linear system, where the points x k are calculated iteratively. We start with x 1 = 0 and for selecting the next point x k+1 we use the same strategy that is used for minimizing the error in interpolation with Leja points [32]. For L = 24 we obtained the coefficients listed in Table 8. The coefficient µ is determined such that a high accuracy is obtained. For this choice the error in the maximum norm is less than 8 × 10 −15 .
Step 3 The third step is the combination of steps 1 and 2: For computational efficiency equation (14) should be rewritten as a single sum. This can be done in different ways. A possibility is to split up the coefficients b m into their real and imaginary parts and pull them inside the approximation of the Gaussian function. This is done in [34]. Let n = m + l, N = M + L, α n = h(µ + in) and set . This leads to the following form which serves as a definition of the numerical approximation REXI(ix) to e ix . Another possibility to rewrite (14) is to compute the real part of the approximation of the Gaussian function. In contrast to the previous reformulation, we make now explicit use of the fact that x has to be real. We set and obtain an equivalent form of (15) which we call REXII, thus . (16) Recall that N = M + L and α n = h(µ + in). In the scalar case both simplifications take roughly the same computational effort and the results are equivalent for real x. This, however, is not true in the matrix case (as we will see in the next section). We now explain how (16) can be used to compute e τ A for a given square matrix with purely imaginary eigenvalues. Assume that A is diagonalizable by a matrix V , i.e., where E is a diagonal matrix. Then it follows that e τ A = V e τ E V −1 . The diagonal entries of e τ E can be computed componentwise by REXII where iλ j are the eigenvalues of A. Finally we obtain This is a well known technique to extend scalar functions to matrices, where the scalar functions are applied to the spectrum of the matrix. Therefore, τ A can be substituted for ix in (16) since A is a matrix with purely imaginary eigenvalues.

The new scheme REXII for matrices
Let A be a matrix with purely imaginary eigenvalues. As explained before, we can substitute τ A for ix in (16). This yields This scheme can be made more efficient by defining C 1,n = (c 1,n hµ + c 2,n hn) and C 2,n = ic 2,n . As we obtain the following extension of REXII for matrices with purely imaginary eigenvalues Recall that N = M + L where L = 24 and α n = h(µ + in). We are now in a position to show how for REXII the accuracy of the scalar case translates to the matrix case. This is the content of Theorem 1. Its proof is a consequence of (17) and (18).
Regarding the error in the matrix case, by Theorem 1 we have to estimate REXII in the scalar case for every eigenvalue iτ λ j of τ A. In the scalar case REXII is accurate if condition (7) holds for every iτ λ j , namely |τ λ j | ≤ (M − 11)h. This implies in the matrix case, that if we choose M and h such that holds true, where ρ(A) is the spectral radius of A, we are guaranteed to obtain results close to machine precision. In practice, since REXII in general is applied to a vector, also the error from solving the linear systems has to be taken into account.
Remark 1. From (22) we observe that it might be possible to reduce the amount of work by using a shifted matrix, i.e. A = A − νI. If ρ(A ) < ρ(A) then less terms are needed to obtain an accurate approximation of e A . Note that the matrix e A can easily be recovered from e A as follows Now, since in our case all eigenvalues are purely imaginary, we have For skew symmetric real matrices, it is not necessary to perform a shift since all eigenvalues arise in complex conjugate pairs, and therefore ζ 1 = −ζ 2 .
Remark 2. If the matrix A = V EV −1 is skew Hermitian, then the matrix V is unitary. If moreover, the error in (21) is estimated in 2-norm, then cond(V ) = 1 and therefore the error in the matrix case is the same as in the scalar case for the eigenvalues of A.
Remark 3. Since REXII in general is used to evaluate the action of the matrix exponential applied to a vector f 0 , the main cost is to solve two linear systems for each summation term. We are able to reduce the cost if the entries of A and f 0 are real. In particular, we observe that c 1,n = c 1,−n , c 2,n = −c 2,−n and α n = α −n . Therefore, we obtain where Γ 0 = 1 and Γ n = 2 for 1 ≤ n ≤ N .
2.3. The original REXI scheme and the differences to REXII The original REXI scheme was developed for matrices A with real entries. It is based on (15) where τ A is substituted for ix. A further simplification comes from the fact that e τ A is real which suggests to neglect the imaginary part of (15). The scheme is thus defined as follows At first look this approximation to e τ A might be the preferred one since if REXI is applied to a real vector f 0 , only one linear system has to be solved for each summation term. In contrast, REXII has two linear systems to solve for each term. But REXI has some drawbacks with respect to REXII.
• First, if A is real with purely imaginary eigenvalues, then V has to be complex and therefore in (19) the matrix V can not be pulled inside the real part of the approximation. Thus, Theorem 1 does not hold for REXI and condition (22) does not apply. To improve the accuracy, M has to be increased. The rate of convergence of REXI can be slow and an extremely large value for M might be needed if a stringent tolerance is prescribed, as we will show in the numerical experiments. Therefore, the main advantage of REXII compared to REXI is relation (22) that allows us to choose h and M in an appropriate manner to produce the same high accuracy as in the scalar case. The cost of this is that two linear systems have to be solved, and thus the sequential part of the scheme doubles. Despite the increased computational effort, however, our scheme is still significantly faster since we can choose a much smaller M .
• Second, reducing the sum from 2N + 1 terms to N + 1 terms, as is done in (23), is not feasible for the original REXI scheme since the coefficients a l in [22] are not exactly equal to a −l . Doing this, as in [34], result in a reduction of accuracy in the approximation of the Gaussian functions (from 7 × 10 −13 to 4 × 10 −8 ) and since REXI is at most as accurate as the approximation of the Gaussian function, the overall accuracy is significantly reduced.
where B is a real diagonalizable matrix, then the transformation matrix V is real. If moreover f 0 is real, the use of (15) for approximating e τ A is justified, and we end up with the following scheme: We call this scheme REXI Extended (REXIE). Note that this formulation is more efficient than REXII as only one linear system has to be solved. Matrices of the above form are, for example, purely imaginary skew Hermitian matrices.

Numerical examples
In this section we provide numerical examples that confirm the theoretical considerations laid out in the previous section. We start with the scalar case and then advance to the matrix case. We always use the conjugate symmetric coefficients given in Table 8 for the implementation of REXI. This is a slight modification of the algorithm in [34] but reduces cost and improves accuracy, as explained in the previous section.

The scalar case
Note that in the scalar case, the absolut error is the same as the relative error, since |e ix | = 1. All the results in this sections are calculated sequentially using GNU Octave.
In order to study the approximation, we fix x and plot the error as a function of h and M . The results are shown in Figure 1. We observe that after a certain value of M the error drops immediately to a value close to machine precision. The point at which this happens is well predicted by the bound |x| ≤ (M − 11)h. Furthermore, we observe that we obtain the best results for h ∈ [0.3, 0.6]. Thus it is not recommended to choose h too large or too small. A small value of h leads to a big value of M , which increases the computational cost. A large value of h results in reduced accuracy. We have also investigated the error as a function of x, where we fixed h and let M to be the minimum admissible values given from (7). We observe that for this numerical test the accuracy is always close to machine precision.

The matrix case
In this section, we analyze the behavior of the different algorithms depending on the parameters h and M for two test matrices A 1 and A 2 .
• The matrix A 1 is the second order finite difference approximation of the advection operator ∂ x with periodic boundary conditions in the spatial domain [a, b] = [0, 1]. The discretization step is b−a n−1 = 1/70. Therefore, A 1 is a skew symmetric matrix with eigenvalues λ j ∈ i[−70, 70]. This matrix has complex eigenvectors, thus we apply REXII (23). With this matrix REXII has to solve two linear systems.
• The matrix A 2 is the second order finite difference approximation of the free Schrödinger operator i∂ xx in the spatial domain [−1, 1] with periodic boundary conditions. The discretization step is 1/35. Thus A 2 is a purely imaginary skew Hermitian matrix with eigenvalues λ j ∈ i[−4900, 0]. Therefore, it is convenient to apply a shift of ν = −2450i. This matrix has real eigenvectors, thus we apply REXIE (25). With this matrix REXIE has to solve only one linear system per summation term.
As vector f 0 in both cases we used the discretization of (2 + cos(2πx)) −1 . The results in this section are computed in GNU Octave. We measure the relative error in the l 2 norm where expm is a Padé approximation of the exponential matrix. The results are shown in Figure 2. We clearly see that REXII is much more accurate for the same M compared to the original REXI scheme. In addition, the bound (22) predicts the behavior of REXII very well, which is not the case for the REXI scheme.

Linear rotating shallow water equations
In this section we apply our proposed REXII scheme and the original REXI scheme to the linear rotating shallow water equations (LRSW) [30]. This is the same problem that has been investigated in [34]. The LRSW are stated as follows where the linear operator A is defined as follows The sought after function is f = (η, u, v) T , where η is the displacement of the surface height, u the velocity in the x direction and v the velocity in the y direction. The simulation domain is the bi-periodic unit square [0, 1] 2 , thus periodic in the x direction, f (τ, 0, y) = f (τ, 1, y), and periodic in the y direction, f (τ, x, 0) = f (τ, x, 1). The grid resolution is D × D = 128 × 128. To apply REXII (23), we have to solve for each term two linear systems and sum up these two calculated solutions. More specifically we have to compute for 0 ≤ n ≤ N (A + α n I)g 1,n = f 0 (α −n I − A)g 2,n = g 1,n g 3,n = Γ n C 2,n (g 1,n + (C 1,n /C 2,n − α −n )g 2,n ) and Re(g 3,n ).
To solve the linear systems, the following strategy is applied (see also [34]). Taking the second and the third component of (A + α n I)g 1,n = f 0 leads to the following equation for the velocities α n 1 −1 α n u 1,n v 1,n = u 0 v 0 + ∇η 1,n .
Thus, if we plug (26) into (27) we obtain ∆η 1,n − κ n η 1,n = r 0,n , where Therefore, we first solve the Helmholtz problem (28) for η. Then we obtain the velocities by plugging the gradient of η into (26). To efficiently solve the linear system, all computations will be conducted in Fourier space. The parameter M of REXI depends on the spectral radius of A. We obtain We will use different initial conditions to illustrate this example. We compare REXII with REXI and the explicit Runge-Kutta time stepping method of order 4 (RK4). For the results in the following test scenarios, we use the maximum norm: where f num (τ ) is the numerical approximation and f ref (τ ) is a reference solution. For REXI and REXII we perform only one single time step of size τ , and (x r , y s ) are the grid points of the domain. Since it is not clear for REXI how to choose M and h, we fix h = 0.2 as is done in [34] and vary M . Before presenting the numerical results we discuss the parallel implementations for both CPU and GPU based systems.

Implementation
The calculations in this section are conducted on a single GPU (NVIDIA V100) and separately on a CPU (a dual socket Intel Xeon Gold 5118 server with a total of 32 cores). We also performed the calculations on a NVIDIA TitanV. Since the performance of both NVIDIA cards is quite similar we only report the V100 results here. The code is written in C++ and CUDA 10.0 is used to program the GPU. For the GPU code we use CUFFT [2] to perform FFTs and to do the reduction sum at the end we use the CUB library [1]. For the CPU code we use the FFTW library [3]. The code for the CPU is implemented sequentially over the summation index n, but each iteration is parallelized with OpenMP [5]. Initially, we used the BLAS implementation found in Intel MKL [4] for the CPU code. However, due to the possibility to aggregate many operations, which is a big advantage for memory bound problems, our OpenMP implementation is actually significantly faster. The GPU code parallelizes over the sum in addition to the spatial parallelization. This is done to exploit the massively parallel architecture of modern GPUs.

CPU Implementation
The implementation for the CPU is sequential over the summation range 0 : N and is therefore performed exactly how it is described in the previous section. We denote bŷ η 1,n the solution of the variable η of the first system andη 2,n the solution of the variable η of the second system where n is the index in the sum of REXI. Similar notation holds for the other variables. To perform the calculation of these variables, we use OpenMP with 32 threads.

GPU Implementation
For the GPU implementation we compute the linear systems in parallel. To facilitate the computation ofη we create a matrix of size (N + 1) × D 2 , where each row corresponds toη n for n = 0 . . . N . We denote this matrix by E1. In a similar way we create two other matrices forû andv, which we call U1 and V1, respectively. We store these matrices in column major format, such that we can apply the fast reduction library CUB to perform the sum over n. Therefore, for REXI we need three such matrices. For REXII we need six of these matrices because we have to solve two different linear systems. We denote these matrices by E1, E2, U1, U2, V1, V2. These matrices are very memory intensive. Each entry requires 16 Bytes of memory, since we are using complex double precision arithmetic. Therefore, for REXII for example, we need approximately 6 · 16 · D 3 · √ 2πτ /h bytes of memory to store these six matrices, which corresponds to approximately 90 GBytes for τ = 50, h = 0.5 and D = 128. The GPU we are working with has only 12 GBytes of memory. Therefore, we have to divide the total amount of required memory to store the variables in S parts. Thus we are calculating at each step only a part of E1, namely E1 s = E1(sN : (s + 1)N , :) where s = 0 . . . S − 1 and N = (N + 1)/S.
There are six kernel functions required. They carry out the following tasks: • computation of E1 s The memory bandwidth and the number of memory write and read operations for each iteration of the algorithm is listed. Form these numbers we can deduce a expected factor of speedup on the GPU compared to the CPU. On the GPU, we need less memory operations because the implementation computes large blocks of data at once.
The corresponding algorithm is summarized in Algorithm 2 and the overall achieved performance is listed in Table 1. We clearly see that the GPU implementation outperforms the CPU implementation by a significant margin.

Algorithm 2 GPU implementation
FFT of initial data divide the required memory in S parts for s = 0 : S − 1 do solve the first linear system: The following initial conditions are used for wave scenario 1: η(0, x, y) = sin(4πx) cos(2πy) − 1 5 cos(4πx) sin(4πy) u(0, x, y) = cos(8πx) cos(2πy) v(0, x, y) = cos(4πx) cos(4πy) (30) This is the same problem as considered in [34]. Since we solve the problem in Fourier space, these initial functions are extremely convenient. They are exactly representable in Fourier space with very few terms. This gives us a big advantage when we have to choose the parameter M of REXII. In (29) we can consider D = 6, since all higher modes do  not contribute to the solution. From (22) we deduce that M is where τ is the final time, since we are performing only one time step. We thus include this problem primarly to provide a comparison to the results obtained in [34]. We will conduct an investigation with more realistic initial values in the subsequent sections. From the results reported in Table 2, we recognize that REXII outperforms the original REXI scheme and RK4 in both accuracy and execution time by a large margin. REXI is more comparable to the Runge-Kutta time stepping method of order 4 (RK4) for this initial conditions and RK4 can even outperform the original REXI method in some cases.
For longer integration times (see the results in Table 3), both REXI schemes drastically outperform the explicit RK4 method (as we would expect). In addition, we can see that REXII is much more accurate than the original REXI scheme.
Moreover, we observe that for all numerical methods the GPU implementation significantly outperforms the CPU implementation. For REXI and REXII the speedup ranges from approximately a factor of 7 to a factor of 15.
In the following examples we use M calculated with ρ(A) given by (29) even if it is possible to choose a smaller one as in this example. The reason why we are doing this is that in general we can not expect the initial conditions to be that convenient.

Wave scenario 2
The same type of initial conditions is used as before with the exception that the frequencies are now much larger. We use: η(0, x, y) = sin(32πx) cos(16πy) − 1 5 cos(32πx) sin(32πy) u(0, x, y) = cos(64πx) cos(16πy) v(0, x, y) = cos(32πx) cos(32πy)  Also in this case, see Table 4, REXII outperforms the other two methods by a large margin. Here both REXII and REXI work much better than RK4. The reason is that the high frequencies force the explicit time stepping method to take extremely small step sizes. In Table 5 we show that the onset of convergence strongly depends on the parameter M for this problem. This is expected as REXII in the matrix case works similarly as in the scalar case. Therefore, the outcome can be compared to the results obtained in Figure 1.

Gaussian scenario
The following initial conditions are used for the Gaussian scenario: η(0, x, y) = exp(−100((x − 0.5) 2 + (y − 0.5) 2 )) u(0, x, y) = 10 −1 sin(64πx) sin(16πy) v(0, x, y) = 10 −1 sin(32πx) sin(32πy) This initial function η, in contrast to the initial functions in the wave scenarios, is not exactly representable in Fourier space. Therefore, in this case we do not have the advantage of a small spectral radius or that a large part of the frequencies are equal to zero.
The numerical results for final times τ = 1 and τ = 50 are shown in Tables 6 and 7, respectively. As before, REXII outperforms the original REXI scheme and RK4 significantly in accuracy. In addition, for both REXII and the original REXI scheme the GPU implementation outperforms the CPU implementation by a factor between 7 and 13.

Conclusion
The original REXI scheme is already a good method to compute the action of the matrix exponential parallel in time. The main downside is that it is not very precise. We  proposed a modification of the REXI approach that achieves accuracy close to machine precision at similar or, for some problems, even lower computational cost. The strength of the REXI methods is the fact that they can be easily parallelized in time (in addition to the commonly used parallelization in space). We have demonstrated this by providing an implementation on massively parallel graphic processing units. The GPU implementation shows a drastic speedup compared to the CPU implementation.

Acknowledgements
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 847476. The views and opinions expressed herein do not necessarily reflect those of the European Commission.

Appendix B
In [22] the authors give the idea how to determine an upper error bound, which we will analyze here more in depth. The following expression has to be estimated:
To do so, the following sum is added and subtracted inside the modulus The term (35) can be estimated using (5): where δ 2 is the approximation error of the Gaussian function. In our case, δ 2 = 8 × 10 −15 , see step 2 in Section 2.
The second sum is more interesting regarding the error. It can be shown that for |x| ≤ (M − m 0 )h this sum is also neglible, where m 0 is a constant related to how small the first two terms of the sum |m|>M ψ h (x + mh) are. An easy calculation shows that ψ h (z) ≤ tol if |z| ≥ 2h − log( √ 4π tol) = ch.