Parallel Integral Equation-Based Nonoverlapping DDM for Solving Challenging Electromagnetic Scattering Problems of Two Thousand Wavelengths

In this paper, a parallel nonoverlapping and nonconformal domain decomposition method (DDM) is proposed for fast and accurate analysis of electrically large objects in the condition of limited resources. The formulation of nonoverlapping DDM for PEC bodies is derived from combined-field integral equation (CFIE), and an explicit boundary condition is applied to ensure the continuity of electric currents across the boundary. A parallel multilevel fast multipole algorithm (MLFMA) is extended to accelerate matrix-vector multiplications of subdomains as well as the coupling between them, and the coupling between different subdomains is computed in the manner of near field to avoid the storage of the mutual impedance. An improved adaptive direction partitioning scheme is applied to the oct-tree of MLFMA to achieve high parallel efficiency. Numerical examples demonstrate that the proposed method is able to simulate realistic problems with a maximum dimension greater than 2000 wavelengths.


Introduction
Computer simulation has shown its great power in solving electromagnetic (EM) problems and become an indispensable tool in engineering design and analysis.For such simulations, surface integral equation methods (SIE) are widely used, and the method of moments (MoM) using SIE is regarded as an effective and popular method [1,2].Nowadays, accurate electromagnetic scattering analysis of electrically large objects is of vital importance in many engineering applications.For the scatterer with large electric size, high-frequency methods like physical optics (PO) [3] and uniform theory of diffraction (UTD) [4] fail to provide sufficient accuracy, while high-accuracy methods like MoM and finite element method (FEM) are restricted from their large amount of memory requirement.Even the multilevel fast multipole algorithm (MLFMA) [5][6][7] can hardly handle the problem of thousands of wavelengths in the condition of limited computing resources.
In order to solve the problem, two approaches, large-scale parallel computing on supercomputers [8] and domain decomposition method (DDM), are concerned.Parallel computing technique can save the computational time to a large extent, in which the most frequently used parallel programming model is the message passing interface (MPI).But for MoM, it is still not easy to simulate a model of hundreds of wavelengths due to limited computational resources.Recent years' in-depth study has been conducted in DDM based on finite elements (FEM), finite difference, and integral equation methods [9][10][11].Among them, the DDM based on integral equations has successfully simulated scattering problems of PEC objects and radiation problems of antenna, which seems to be a method both feasible and efficient for solving complex and large problems.Basically, the strategies of DDM can be categorized into two types: one is to divide the whole solution region into many subdomains, and each subdomain is enclosed by a closed surface, named enclosed DDM (E-DDM) [12][13][14][15].The other is also to divide the whole solution region into many subdomains, but each subdomain is not required to be enclosed by a closed surface, named unclosed-DDM (U-DDM).More recently, a so-called U-DDM-based discontinuous Galerkin (DG) method has been successfully developed to implement DDM of the SIE MoM and the volume integral equation (VIE) MoM [16][17][18][19].It has been shown that this DG method using Krylov iterative methods is attractive for complex multiscale EM problems.However, so far, this powerful DG has not been derived and investigated for the challenging problems of scattering from electrically large objects.
The aim of this paper is to derive an efficient formulation of E-DDM for analysing EM scattering from electrically large PEC objects whose MLFMA models are too large for the user's computer to accommodate.In this paper, the marriage of an efficient parallel integral equation-based nonoverlapping domain decomposition method (IE-NDDM) and high performance of computing (HPC) is presented.On the basis of the first-order Robin-type transmission conditions (TC) [20], an explicit boundary condition is applied to ensure the continuity of electric currents on the touching faces between two subdomains.With the aid of the TC, the unphysical reflection at subdomain interfaces is suppressed, and each subdomain is allowed to be meshed independently, providing unprecedented flexibility and convenience for the process of mesh generation.The proposed method allows all the computer resources to be used to a single subdomain at a time, the advantage of which is that the imbalanced workload caused by the unevenly divided subdomains is reduced, and at the same time the scale of the problem to be solved is greatly enlarged.For the solution of the interior of each subdomain and the coupling between subdomains, parallel MLFMA is used to speed up the matrix-vector multiplications, and to avoid the storage of the mutual impedance, different subdomains exchange the coupling in the manner of near-field instead of traditional MLFMA with weight functions expanded in the disaggregation stage.An improved adaptive direction partitioning scheme is applied to obtain a good load balance for MLFMA, further improving the capability of the proposed IE-NDDM.Based on the proposed algorithms, we have also developed a hybrid message passing interface (MPI)/OpenMP parallel implementation to fully exploit the recent success of multicore processors and massively parallel distributed memory supercomputers.
The remaining part of this paper is organized as follows.First, we outline the basic principle of IE-NDDM.Then, the parallelization strategy for the iterative process and MLFMA solver is designed, respectively.The numerical results are presented in Section 4 followed by the conclusion.

Nonoverlapping Domain Decomposition Method
2.1.Notations.Compared with overlapped DDM, nonoverlapping DDM only adds touching faces between neighbouring subdomains to make each of them closed, and CFIE can be applied to obtain faster convergence compared to EFIE when using the MLFMA solver.Without loss of generality, consider a PEC object divided into two identical nonoverlapping subdomains which is illuminated by an incident plane wave E inc , H inc shown in Figure 1. S 1 and S 2 are the exterior boundary of subdomains Ω 1 and Ω 2 , respectively.S + m denotes the touching face of subdomain Ω m , and S m is the exterior boundary defined as S m /S + m m = 1, 2 .J m and J + m are the equivalent electric currents on S m and S + m m = 1, 2 , respectively.Take subdomain Ω 1 as an example.According to the equivalence principle, the EFIE and MFIE of the exterior side (free space) are written as: where E s , H s are the scattered fields generated by the equivalent electric current on the PEC surface.The scattered fields are expressed as: where L X = −jk S X r ′ + 1/k 2 ∇ ′ ⋅ X r ′ ∇ G r, r ′ ds ′ and K X = − S X r′ × ∇G r, r′ ds′.η 0 = μ 0 /ε 0 and k = ω μ 0 ε 0 are the wave impedance and wave number in the free Touching-faces  International Journal of Antennas and Propagation space, respectively.G r, r ′ = e −jkR / 4πR is the free space Green's function, where R = r − r′ .When the field point r ∈ S + 1 , the first-order Robin-type transmission condition [18] is widely used on the touching faces expressed as: By applying the transmission condition to the adjacent subdomains, the continuity of electric currents is ensured.Thus, it allows the solution of nonoverlapping DDM equals to that of the original problem.

Galerkin Discrete Formulation.
In order to simplify the presentation, we still consider a decomposed PEC object with two subdomains shown in Figure 1.To solve (1), the RWG functions [21] are used to expand the surface current J residing on S. Assume that there are 2,n r on S 2 .Then, the expansion of J on S can be written as: Based on the idea of nonoverlapping DDM, the touching face S + 1 S − 2 is introduced to make subsurface S 1 S 2 a closed subdomain 1 (2), and the current J i residing on subdomain i i = 1, 2 can be expressed as: In this paper, an explicit boundary condition which comes from (3), is proposed to efficiently solve the closed PEC problem.In essence, it can be considered as an absorbing boundary condition which not only enhances the continuity of electric currents that flow across the cutting contours between two subdomains but also satisfies that the inner electric field, charge accumulation, and electric potential are zero inside the PEC object.The troublesome singularity of G r, r′ on the touching faces of two adjacent subdomains is eliminated, and the interface meshes on S + 1 and S − 2 are allowed to be nonconformal, providing unprecedented flexibility and convenience for mesh generation of complicated objects.More importantly, the convergence rate is improved when using the Gauss-Seidel method.
Generally, for the R subdomain case, the Galerkin test is performed to the weighted linear combination of (1) and the following matrix equation is obtained: where, when i = j, Z ij is the CFIE self-interaction matrix of the ith subdomain Ω i and when i ≠ j, Z ij is the mutual-interaction matrix of subdomains Ω i and Ω j .X i is the unknown current coefficient matrix, and V i is the voltage matrix of subdomain Ω i .R denotes the number of subdomains.At this point, the unknown current coefficients in ( 5) and ( 6), the mutual-interaction matrices Z ij i ≠ j , and the voltage matrices V i and can be expressed in an arranged vector form as: Note that during the execution of the algorithm, the global model equation (8) does not have to be fully constructed since the fast algorithm MLFMA will be applied.

2.3.
Resolution of the DDM Linear System.For the proposed IE-NDDM, the Gauss-Seidel scheme ought to be adapted so as to allow all the computer resources to be used to a single subdomain problem at a time whose whole model is too large for the current computer resources to accommodate.The unknown coefficients on a subdomain will be successively updated by solving the local model equation defined on the subdomain until convergence.The convergence rate is improved by using the explicit boundary condition in (7), and it will be verified in the numerical results.
Given the tolerance δ, at the initial step, k = 0, X i 0 = 0 i = 1, 2 ⋯ R , and at the k + 1th step, the unknown electric current can be expressed as: The iterative steps continue, and the residual error is calculated after each step.When max ε 1 , ε 2 , ⋯, ε n ≤ δ at the kth step, the iterative 3 International Journal of Antennas and Propagation process stops and outputs X k ; otherwise, the process continues.Equation ( 6) can be rewritten as: i.e., the voltage matrix is the superposition of the primitive incident field and the scattering field from the other subdomain Ω j .Set X S + i i a null matrix, and X j is written as: namely, using the explicit boundary condition (7).
It is worth pointing out that at the k + 1th step, there is no need to store the mutual-interaction matrix Z ij .The product of Z ij and X k+1 j j < i or Z ij and X k j j > i can be obtained by solving the inner product of the scattered field on Ω i and the test function, and hence the memory requirement is reduced.The flowchart of the iterative process is shown in Figure 2.
In this paper, a hybrid MPI/OpenMP strategy is also implemented.Take the very large problem for example; we can use M MPI processes on M distributed computing nodes, and the parallelization within each MPI process is attained using OpenMP, for example, 4 threads, which can fully exploit the fast memory access in the shared-memory multicore processors.Figure 3 represents a depiction of such a strategy.Thus, the scale of a single subdomain can be maximized under the condition of the current computing resources, which is the criterion for model division used in this paper.

Acceleration of IE-NDDM Using Parallel MLFMA.
For electrically large problems, the interior of each subdomain and the coupling between subdomains are accelerated using MLFMA to further reduce the memory consumption and computational complexity, and that is to speed up far interactions of Z ii X i and Z ij X j .Based on the addition theorem of Green's function and the oct-tree structure, MLFMA can reduce the computational complexity of a matrix-vector product from O N 2 to O N log N and is able to solve electrically large objects that were previously too large for MoM.The principle about MLFMA and its parallelization scheme can be found in [5][6][7][22][23][24][25]. The parallelization scheme employed by the proposed IE-NDDM is detailed here.
The spatial partitioning and direction partitioning are the key factors for parallelization of MLFMA.Simply using 4 International Journal of Antennas and Propagation spatial partitioning at lower levels while using direction partitioning at higher levels and setting a transition level in the middle level fails to obtain ideal scalability when the number of processes exceeds O N 0 5 [26].To solve this problem, a newly developed adaptive direction partitioning strategy (AdP) [27] is adopted in this paper which combines the spatial and direction partitioning in a seamless manner.For AdP, the number of plane-wave direction partitions varies with tree node descendants at a given level of the MLFMA tree.The directions of the tree nodes that have more descendants are usually partitioned into more portions.This dynamic partitioning strategy is adaptable for modern CPUs that have 8, 10, 12, or other number of cores.As is shown in Figure 4(a), the tree nodes from level 2 to level 4 are assigned to 6 processes, and the processes are marked P0 ~P5 shown on the tree nodes.Take level 3 in Figure 4(a) for example.
The three tree nodes in this level are assigned to one, two, and three processes, respectively.5 International Journal of Antennas and Propagation during the three main stages of MLFMA, which is able to significantly reduce the latency among processes.In this paper, Lagrange polynomial interpolation [28] is used to calculate the translation matrix.The number of sampling points and interpolation points is 6L and 3, respectively, and hence the complexity for filling the translation matrix is approximately reduced to O N .The CPU time and memory requirement for parallel MLFMA are further reduced when solving the interior of each subdomain and the coupling between subdomains, which enables IE-NDDM to solve very large problems in an ordinary computer.

Results and Discussion
In this section, numerical examples are given to study the performance of the proposed IE-NDDM, including the accuracy, the parallel efficiency, and the convergence behaviour.The computing platform used in the section is the HPC cluster from Xidian University (XD-HPC).Each node has two 12-core Intel Xeon 2690-v2 2.2 GHz CPUs and 64 GB memory connected with each other by 56 Gbps Infini-Band network.The parallel GMRES is combined with a block-diagonal preconditioner as the inner iterative solver for MLFMA.The residual error for inner iterative convergence is set to 3.0 × 10 −3 .
3.1.Performance Analysis for a PEC Sphere.First, the accuracy of the proposed parallel IE-NDDM is validated through comparison with the result of Mie series and MLFMA.A PEC sphere divided into four subdomains is simulated in this example with each colour representing one subdomain, the model of which is shown in Figure 5.The diameter of the sphere is 3 m, and the frequency of the incident plane wave is 3 GHz.The whole sphere is modelled by CFIE discretised with a total of 1030251 unknowns.The bistatic RCS results are plotted in Figure 6.From the comparison, it can be seen that the results from the IE-NDDM agree well with those from Mie series and MLFMA.Figure 7 compares the convergence behaviour of the proposed IE-NDDM by employing different boundary conditions.As can be seen, a faster convergence rate is obtained when we use the boundary condition proposed in this paper which takes 8 steps to converge while the traditional one-order Robin-type transmission condition takes 16 steps.

Performance Analysis for a Metallic Missile.
Here, we take a metallic missile as an example to study the computation of current distribution on it and the strong scalability.The incident plane wave propagates towards head (−x), and the polarization direction is +z.The frequency of the plane wave is 10 GHz.The dimension of the missile is 1.99 m × 1.2 m × 0.65 m, and it is divided into three subdomains shown in Figure 8 with each colour representing one subdomain.The outer iterative residual is set 0.001, and the convergence rate has reached 0.00084 at the third step.Currents over the missile surface calculated by MLFMA and IE-NDDM are shown in Figures 9 and 10, respectively.By comparison, it can be seen that the results agree well and it allows the accuracy of IE-NDDM to be shown.Note that the meshes on the touching faces of two adjacent subdomains is nonconformal and the continuity of current is enforced correctly.
In the strong scaling experiment, we examine the solution time required for the missile problem by increasing the number of cores.In parallel computation, six MPI processes are assigned with one computing node, in which four OpenMP threads are used in each MPI process.The computing node in use has two 12-core Intel Xeon E5 2.2 GHz processors.We still consider EM scattering from the missile at 10 GHz and increase the number of computing nodes from 1 to 16.Thus, the total computing cores increase from 24 to 384.The timings of various simulations are given in Figure 11.We see that the speedup is super-linear between 48 and 192 cores.At peak performance on 384 cores, the speedup relative to 24 cores is approximately a twelve times decrease in simulation time.The computational statistics for solving each subdomain and the entire problem are recorded in Table 1.Eight computing nodes are used, and six MPI processes are assigned with one computing node, in which four OpenMP threads are used in each MPI process.We can observe that the parallel IE-NDDM algorithm leads to over 73.7% memory reduction compared with MLFMA.The bistatic RCS results computed by the parallel IE-NDDM algorithm are plotted in Figure 13, in which the MLFMA results are also given for comparison.From the comparison, it can be seen that the results using IE-NDDM agree well with those from MLFMA.Thus, it provides a theoretical support for the solution of larger scattering objects.
3.4.Performance Analysis for a Metallic Aircraft of 2000 Wavelengths.In this example, an extremely electrically large problem, the scattering characteristics of another aircraft, has been solved by the parallel IE-NDDM algorithm.The incident plane wave is toward the bow, and the polarization direction is +z.The aircraft is 19 m long, 13.5 m wide, and 4 m high, divided into ten subdomains shown in Figure 14.For such a large case, the outer iterative residual is increased to 0.01, and 48 computing nodes are used and one MPI process is assigned with one computing node, in which twenty-four OpenMP threads are used in each MPI process.The frequency of the incident plane wave is 33 GHz, and the corresponding electric size of the aircraft is 2111 λ × 1500 λ × 444.4 λ.At this time, the number of unknowns reaches 714,321,303 for the overall model which is too large for MLFMA to obtain the overall result using current computer resources.For the parallel IE-NDDM, the number of unknowns of the largest subdomain in IE-NDDM is 92,154,327, much smaller than the number of unknowns of the overall model.The total CPU time is 65.8 hours, and the peak memory consumption is 1.48 TB when IE-NDDM takes four steps to converge.This indicates that when the computational resources are limited, the proposed method is of great significance especially for the solution of very large problems in engineering application.The bistatic RCS results are plotted in Figure 15.

Conclusions
In this paper, an IE-NDDM using the special explicit boundary condition has been established for analysing EM scattering from electrically large PEC objects whose MLFMA   8 International Journal of Antennas and models are too large for the user's computer to accommodate.The parallel multilevel fast multipole algorithm is used to accelerate the computation of the subdomains and the coupling between them, and the improved adaptive partitioning scheme enhanced its parallel efficiency.The coupling between different subdomains is computed in the manner of near field, and Lagrange polynomial interpolation is applied to the calculation of the translation matrix, which significantly reduces storage and CPU time.In the case of limited resources, the engineering problem of the scattering characteristics of thousands of wavelengths is solved.

Figure 2 :
Figure 2: The flowchart of the iterative process.

Figure 4 (Figure 3 :Figure 4 :
Figure4(a), the tree nodes from level 2 to level 4 are assigned to 6 processes, and the processes are marked P0 ~P5 shown on the tree nodes.Take level 3 in Figure4(a) for example.The three tree nodes in this level are assigned to one, two, and three processes, respectively.Figure4(b) indicates the plane-wave direction partitioning in level 2, and each spot represents a plane-wave direction.It is worth pointing out that there is no special requirement for the number of processes using AdP.Each process is involved in both computing and communication, and hence the amount of computation and communication are more balanced.By utilizing the nonblocking communication technique, communication and computation are overlapped more closely

Figure 5 :Figure 6 :
Figure 5: Model of the sphere divided into four parts.
in this paper First order Robin-type transmission condition

Figure 7 :
Figure 7: Convergence rate of IE-NDDM with respect to the boundary condition.

Figure 12 :
Figure 12: Model of the aircraft divided into five parts.

Figure 14 :
Figure 14: Model of the aircraft divided into ten parts.

Figure 15 :
Figure 15: Bistatic RSC of the aircraft at 33 GHz.

Table 1 :
Computational statistics of aircraft.International Journal of Antennas and Propagation set to 0.005, and the convergence rate has reached 0.0028 at the fourth step.