Single and Dual-GPU Generalized Sparse Eigenvalue Solvers for Finding a Few Low-Order Resonances of a Microwave Cavity Using the Finite-Element Method

This paper presents two fast generalized eigenvalue solvers for sparse symmetric matrices that arise when electromagnetic cavity resonances are investigated using the higher-order finite element method (FEM). To find a few loworder resonances, the locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm with null-space deflation is applied. The computations are expedited by using one or two graphical processing units (GPUs) as accelerators. The performance of the solver is tested for single and dual GPU hardware setups, making use of two types of GPU: NVIDIA Kepler K40s and NVIDIA Pascal P100s. The speed of the GPU-accelerated solvers is compared to a multithreaded implementation of the same algorithm using amulticore central processing unit (CPU, Intel XeonE5-2680 v3 with twelve cores). It was found that, even for the least efficient setups, the GPU-accelerated code is approximately twice as fast as a parallel CPU-only implementation.


Introduction
Computational electromagnetics (CEM) concerns the design and implementation of fast and accurate solvers of Maxwell's equations.The ultimate goal is to solve complex problems in the shortest possible time.To achieve this goal, algorithms have to be considered in the context of the hardware they will be implemented on.As hardware evolves, known numerical techniques need to be reexamined and modified in order to take advantage of the possibilities offered by new hardware.GPU-computing is an emerging trend in computational electromagnetics.Initially, GPU-acceleration was considered for the finite-difference time-domain (FDTD) method [1][2][3][4].In recent years, some effort has also been devoted to taking advantage of the processing power of graphics accelerators for other techniques, such as the method of moments [5][6][7], the multilevel fast multipole (MFML) algorithm [8], and the discontinuous Galekin (DG) method [9], as well as other techniques [10].In this paper, we consider GPU-acceleration of the finite element Method-one of the most powerful and versatile numerical techniques, which is often applied to the solution of boundary value problems that arise in electromagnetics.To date, most publications on FEM in the context of electromagnets have been related to the solution of a linear system of equations [11][12][13] and FEM matrix generation and assembly [13][14][15].In this paper, we concentrate on finding the solution of the generalized eigenvalue problems that emerge when FEM is applied to simulate the free oscillation of microwave cavities.GPU-acceleration of eigenvalue solvers is less frequently disused in the literature, and in most cases standard, rather than generalized, eigenvalue problems are considered [16][17][18][19].Solvers for standard eigenvalue problems are not suitable for FEM analysis, so we focus on generalized eigenvalue problems, extending our recent work on this topic [20].

Finite Element Method
Let us consider a lossless dielectric-loaded electromagnetic cavity Ω enclosed by a perfect electric conductor boundary S. The behavior of the electric field inside Ω and the frequencies of free oscillations inside are determined by the vector Helmholtz equation: where ì E is the electric field, k 0 = ω/c is the wavenumber, r is the relative permittivity, ω is the angular frequency, and c is the speed of light in vacuum.The nonzero frequencies for which the above equation has a nontrivial solution determine the resonances of the cavity.
To solve the vector Helmholtz equation in 3D, we use the finite-element method with a tetrahedral mesh and higherorder basis functions [21], [22].To this end, we work with the weak formulation of (1): where ì w is a vector testing function.We then use the Galerkin method with hierarchical vector basis functions up to the third order [23], arriving at: where K, M ∈ R n×n are large and sparse real-valued stiffness and mass matrices, respectively, e ∈ R n is a vector of expansion coefficients for the basis functions, and n is the number of degrees of freedom (DoF).Additionally, the matrix M is positive definite.Equation (3) defines the generalized eigenvalue problems.Since matrices are real and symmetric, and one of them is positive definite, the eigenvalues are real and nonnegative.

Finding Small Nonzero Eigenvalues
The system matrix that emerges from FEM using higher-order basis functions is large and sparse.This means that direct solution methods for eigenproblems cannot be applied.The only way to find the resonances and modal fields is to use iterative techniques.In practice, we are interested in a few low-order resonances that correspond to eigenvalues with small magnitudes.One algorithm recommended for finding a few algebraically smallest eigenvalues of symmetric positive definite matrix pencils is the locally optimal block preconditioned conjugate gradient (LOBPCG) method proposed by Knyazev [24].With a good preconditioner, this method converges rapidly.Moreover, it does not require any matrix factorization and, because it involves a three-term recurrence, its memory requirements are relatively low.Finally, the basic operation in the algorithm is a sparse matrix-vector product.These features are desirable from the point of view of efficiently using the massive parallelization capabilities offered by state-of-the-art graphics processing units.The LOBPCG algorithm cannot be applied directly to the FEM problems resulting from the weak form (2), as the ∇ × ∇ × (•) operator yields zero when applied to the gradient of any scalar function.As a result, the eigenproblem (3) has multiple zero eigenvalues.Such eigenvalues would be found by LOBPCG, but they are of no interest, since they are not physical solutions fulfilling Maxwell's equations.More precisely, the model field associated with each zero eigenvalue does not fulfill the condition that ∇ • t ì E = 0 .To get rid of these spurious solutions, a zero divergence condition can be imposed on the vectors produced during each iteration of LOBPCG [25].This process can be called null-space filtering, as imposing the divergence-free condition is, in practice, implemented as in a deflation using a projector that ensures that the projected vectors are M-orthogonal to the null-space.The projected vectors thus have no components from the null-space, and are then used to construct model fields (eigenvectors) that correspond to nonzero eigenvalues and fulfill ∇ • t ì E = 0.
The LOBPCG method with null-space filtration is shown in Algorithm 1.The null-space filtering is applied in lines 2 and 13.In practice, this is implemented as a solution of a relatively small system of sparse equations with multiple right-hand sides.The system matrix for null-space filtration is identical to the system matrix obtained when solving the Poisson equation with the Lagrange finite elements.The matrix is given by Y T MY), with Y being a discrete gradient operator with only two nonzero elements per row (1 or -1) [25].To solve the Poisson equation, we use the conjugate gradient method with a hierarchical multilevel preconditioner operating in a V-cycle (Algorithm 2 [22].The hierarchical multilevel preconditioner (Hie-ML) is shown in lines P.1-P.11 of Algorithm 2. The preconditioner takes advantage of the hierarchy of the basis functions used in FEM.In our case, the basis functions up to the third order are considered (QTCuN) [23], so we used three levels in a V-cycle.On the lowest level (P.4), the direct solution of a linear system is needed.However, since on this level only the first-order basis functions are involved, the system is very small and its solution does not pose any problem, For the smoothing that is performed when transfers between levels occur during restriction and prolongation (lines P.6-P.10),we used a few weighted Jacobi iterations.
Another important element is the preconditioner P used in the LOBPCG (Algorithm 1, line 12).In this paper, we employ for this a sequence of operations approximately equal to the inverse of the matrix A = K− κM, on a vector or a block of vectors (κ is a real scalar chosen to be smaller than, but close to, the smallest nonzero eigenvalue of the matrix pencil (K, M).This is implemented as a few iterations of the preconditioned conjugate gradient method (PCG-V)-the same method used for null-space filtering.In other words, in Step 12 of the LOPCG algorithm, instead of computing H k = P −1 R k , we used an approximate solution of the matrix system Note that, except for the direct solution on the lowest level (Algorithm 2, line P.4), almost all other steps in the LOBPCG algorithm can be expressed in terms of a matrix-times-vector operation or simple BLAS1-type operations.Also, the preconditioned conjugate-gradient algorithm involves a sparse matrix-vector product (SpMV).This is typical of all iterative techniques based on Krylov-spaces.In order to execute the iterations rapidly, the performance of the sparse matrix-times vector product should be increased.SpMV multiplication can be executed much more rapidly on GPUs than on multicore Central Processing Units (CPUs).
Algorithm 1 Stabilized LOBPCG for real symmetric problems.Inputs: K and M are sparse real-valued symmetric n × n matrices, is the assumed accuracy, P is a preconditioner, Y is a basis in the nullspace, X 0 is the initial block vector of size n × (q + 1), q is the number of eigenvalues to be computed, and MaxIter is the maximum number of iterations.γ is the shift value used in the sorting algorithm.The outputs of the LOBPCG method are the q smallest nonzero eigenvalues {σ 1 , . . ., σ q }, stored in a diagonal matrix Σ output , and a dense block vector X output of size n × q consisting of the q respective eigenvectors.In this algorithm, σ is the eigenvalue corresponding to k 2 0 from (3).
Algorithm 2 Preconditioned conjugate gradient method (PCG-V) with hierarchical multilevel preconditioner (Hie-ML) operating with a single V-cycle.The algorithm solves the system of equations Ax = b.Matrices A i, j , i, j = 1, 2, 3 are blocks of A corresponding to various orders of basis and testing functions.

Implementation
Two GPU-accelerated implementations of the LOBPCGG algorithm were developed.The first was intended for a single-GPU operation.In the single-GPU version, we applied preconditioning and null-space filtering to the GPU-accelerated PCG-V solvers developed previously [11], [12].In the PCG-V with hierarchical multilevel preconditioner solvers, a direct solution of a sparse system of equations is needed on the lowest level of the V-cycle.This step is executed on a CPU, and we employed a sharedmemory multiprocessing parallel direct sparse solver, Intel MKL PARDISO.For BLAS-like operations on a GPU, we used CUDA and cuSPARSE (v8.0).One exception was the sparse-matrix vector product (SpMV).As explained above, this operation is crucial for all iterative solvers based on Krylov spaces, and should be optimized for the hardware architecture that is used.The performance of this computational kernel depends on a sparse matrix storage format.Here we used the Sliced ELLR-T format that we designed [26] specifically for accelerating the iterative solution of large sparse real-valued and complex-valued systems of equations.In this format, the sparse matrix is first permuted according to the number of nonzero elements, and is then divided into submatrices (slices) consisting of a certain (fixed) number of adjacent rows.Next, the rows in which the number of nonzero entries is not a multiple of 16 are padded with zeros.The goal of this padding is to obtain coalesced and aligned access to global memory.When SpMV is performed, multiple threads are assigned to each row.The number of threads per operation on a particular row depends on the average number of nonzero entries in the rows of the sparse matrix.For the real-valued FEM matrices used in our code, Sliced ELLR-T performs at 36.6 GFlop/s with SpMV on the P100 GPU, while the CSR format used by the SpMV in NVIDIA's original cuSPARSE library (v.8.0) achieves 21.2 GFlop/s for the same operation.
The second GPU-accelerated implementation uses two GPUs to perform preconditioning in Step 12 of the algorithm.For this implementation, we used our dual-GPU PCG-V solver [27].All operations other than Step 12 were carried out on one GPU.It is crucial to the efficiency here to split the system matrix between two GPUs so that computations performed by each GPU are balanced and so data transfer between accelerators in each PCG-V iteration is minimized.It is also essential for the Hie-ML preconditioner that the matrices on each GPU can be split into a nine-block structure, with each block corresponding the order of the basis and testing function.The data distribution technique proposed in [27] first splits tetrahedra into two sets, with approximately the same number of tetrahedra in each set.Each set corresponds to one GPU.To this end, a mesh is represented as a graph and a graph partitioning algorithm is employed; this chooses the tetrahedra for the two sets in such a way that the number of common edges for the subgraphs belonging to both sets is minimal.Based on the subgraphs, the mass and stiffness matrices are generated separately for each GPU [14].This guarantees that the block structure of these matrices preserve the hierarchy of the basis and testing functions.After this step, each GPU contains its local data plus a small number of rows that are common to both sets (typically 1%-2% of the total number of rows).The matrices are then converted to Sliced ELLR-T format.When the SpMV product is evaluated, each GPU uses its local data, also on common rows.Once this has been done, synchronization is enforced and the results of multiplication on common rows are exchanged between GPUs.

Numerical Results
All numerical tests were executed on an Intel Xeon (E5-2680 v3, 2.5 GHz, twelve cores) with 256 GB memory and two hardware setups with one or two GPUs.The GPU accelerators we used were either NVIDIA Tesla K40s (Kepler with 2888 CUDA cores) or P100s (Pascal with 3584 CUDA cores).Each accelerator was equipped with 12 GB GPU RAM.We also developed a reference (CPU-based) implementation based procedures from the Intel Math Kernel Library, so that the reference computation was performed in parallel using twelve CPU cores and procedures optimized for best performance on Intel multicore processors.
In order to investigate the performance of all solver implementations, the problem of finding five low-order resonances of a dielectric loaded resonator was considered.This involves a cylindrical PEC cavity loaded with a dielectric disc ( r = 37) placed on a dielectric support with r = 2.2.The resonator is a lossless variant of the cavity investigated in [28].FEM matrices were generated using the higherorder FEM code described in [29], using the dimensions provided in [28].Table 2 shows the simulation parameters forthe LOBPCG and matrices.Note that the accuracy of the preconditioner PCG−V is set to a high value of 10 −2 .Nullspace filtering was carried out by iteratively solving a small system of equations with k = 320134 rows until the error dropped below 10 −6 .
Table 3 shows the eigenvalues and resonance frequencies for the first five low-order modes of the cavity calculated with LOBPCG with null-space filtering.The values found by all our solvers are identical and are in very good agreement with the resonances computed using CST software.
The times taken by the CPU-accelerated and GPUaccelerated LOBPCG implementations for various steps of the algorithm and hardware setups are shown in Tab. 1.Using GPUs as accelerators involves some overhead due to the extra time needed for memory allocation and transfer in Step 1 of the algorithm.In this step, the GPU implementations are slower than the CPU-only version.However, the time taken by pivotal phases, such as the preconditioner (Phase 12) and the null-space filtering (Phases 2 and 13) is significantly reduced in the GPU implementations.As a result, the time taken by the complete LOBPCG is reduced from 254 seconds to 143 and 108 seconds respectively for the K40 and P100 GPU accelerators.In the dual-GPU arrangement, LOBPCG Phase CPU (T 1 ) GPU (T 2 ) GPU (T 3 ) GPU (T 4 ) GPU (T 5 ) Phase / Acceleration this time is reduced even further, resulting in about 1.9 (K40) and 2.6 times (P100) better performance than the reference CPU-only implementation.
The results for all GPU implementations could be further improved by introducing blocking.In fact, a sparse matrix needs to be multiplied by several vectors in various steps of the LOBPCG algorithm.This multiplication is executed on a GPU faster when the vectors are blocked and collected in a single tall and skinny matrix.To solve sparse systems with multiple right-hand sides, we developed our own computational kernels for the sparse matrix-dense matrix product (SpMM).Using the Sliced-ELLR-T format, the Pascal P100 accelerator obtained 112 GFlop/s on a sparse real-valued matrix with sixteen vectors in a block, while no more than 23 GFlop/s was achievable using cuSPARSE [30].It should however be noted that there is an upper limit on the acceleration of the PCG-V preconditioner.The performance of this solver depends on two factors: (1) the speed of the BLAS1 and sparse matrix multiplications and (2) the direct solution performance on the lowest level of the preconditioner.We used a GPU to accelerate (1), but in all implementations the direct solver is performed using Intel MKL and LU factorization.According to Amdahl's law [31], the maximum acceleration that can be achieved for PCG-V is around 11.The speedups for various scenarios are given in Tab. 4. It can be seen that, with two P100s, we can achieve 4.7, which is a decent result.The Amdahl's law analysis can also be applied to compute the maximal speedup that can be obtained for PCG-V using two GPUs.With two K40s and P100s performing the computations, the upper bounds are 1.66 and 1.46, respectively.Thus, the speedups of 1.3 and 1.1 obtained in our PCG-V implementation are about 76% and 79% of the theoretical maximal dual-GPU performance.Adding a second GPU gives a slight speed improvement importantly also provides extra GPU memory, allowing larger problems to be solved.
Also, even if we somehow managed to reduce the time taken by the most time-consuming steps of the LOBPCG algorithm (the preconditioning in Step 12 and the nullspace filtering in Step 13) in GPU setups to zero, this would subtract 43.1 seconds from the total LOBPCG time for the dual P100 scenario, yielding a maximum speedup of 4.79= 253.9  (96.1−43.1) .Our current result is 2.6, or 54% of the upper bound.

Conclusion
GPU-accelerated solvers for the sparse symmetric generalized eigenvalue problems arising from the FEM simulation of microwave cavities have been described and their performance has been investigated for single and dual GPU hardware setups using two types of GPUs, K40s (NVIDIA Kepler) and P100s (NVIDIA Pascal).The solvers are based on the locally optimal block preconditioned conjugate gra-dient method (LOBPCG) with null-space filtering.We used the conjugate gradient method with a hierarchical multilevel preconditioner operating on a single V-cycle for the preconditioning of the LOBPCG iterations and for filtration.The speeds of the single and dual-GPU solvers for both types of accelerators were compared to a multithreaded implementation of LOBPCG optimized for a multicore central processing unit (CPU, Intel Xeon E5-2680 v3, twelve cores).We found that for all the tested computational scenarios, the use of the graphics accelerators reduces the time to solution by a factor ranging from 1.8 to 2.6.Further speed gains are expected by using blocking in the sparse matrix-vector multiplication and LDL T factorization of matrices on the lowest level of the PCG-V solver.