Highly Parallel Demagnetization Field Calculation Using the Fast Multipole Method on Tetrahedral Meshes with Continuous Sources

The long-range magnetic field is the most time-consuming part in micromagnetic simulations. Improvements both on a numerical and computational basis can relief problems related to this bottleneck. This work presents an efficient implementation of the Fast Multipole Method [FMM] for the magnetic scalar potential as used in micromagnetics. We assume linearly magnetized tetrahedral sources, treat the near field directly and use analytical integration on the multipole expansion in the far field. This approach tackles important issues like the vectorial and continuous nature of the magnetic field. By using FMM the calculations scale linearly in time and memory.


Introduction
Micromagnetic algorithms are an important tool for the simulation of ferromagnetic materials used in electric motors, storage systems and magnetic sensors. In micromagnetic simulations the demagnetization field is the most time-consuming part. Many different algorithms for solving the problem exist, e.g. [12,5,6,18,3,15,4,16,1,11]. Direct solutions compute all pair-wise interactions and hence they scale with O N 2 operations, where N denotes the number of interaction partners.
Modern Finite Element Method (FEM) implementations [7,1] have linear scaling for the Poisson equation only for closed boundary conditions when a multigrid preconditioner is used [20]. The demagnetization potential has open boundary conditions defined at infinity. For the computation of the potential at the boundary, hybrid FEM/BEM [12] (Finite Element Method with Boundary Element Method Coupling) and the shell transformation approach [5] add an additional complexity of O M 2 , where M denotes the number of surface triangles. This can be further reduced to O M log(M ) by the use of H-matrix approximation [6,18,15] or fast summation techniques utilizing non-uniform fast Fourier transform (NUFFT) [10].
The Fast Multipole Method (FMM) was originally invented by Greengard and Rokhlin [14] for particle simulations. The version presented in this paper is an extension to the class of Fast Multipole Codes [8,19,22]. This version is based on the direct evaluation of volume integrals, putting it in a category with nonuniform grid ( [16]), Fast Fourier Transform [FFT] methods ( [17,1]) and Tensor Grid methods ( [11]). A useful comparison of existing codes is given in the review [2]. The basis of the presented approach is a spatial discretization of the analytical expression of the integral representation of the magnetic scalar potential where the M subvolumes Ω j are tetrahedrons wherein the magnetization M j is assumed to be linear (i.e.affine basis functions). The FMM approach in this work approximates (2) by treating the near field both directly and analytically and the far field by (exact) numerical integration of a multipole expansion of the integrated convolution kernels. Among the above mentioned methods only the FEM/BEM approach and a non-uniform FFT [10] approach are suited for irregular grids, but FMM is expected to scale better to many processors because of its small memory footprint. Some implementations of current micromagnetic FMM codes [25,4] only support homogeneously magnetized regular grids. The hereby presented method works on linearly magnetized tetrahedrons, making it a suitable alternative to FEM codes on irregular grids. However, the tetrahedral nature of the sources complicates the FMM procedure by demanding more elab-orate approaches for the direct interaction (see section 2.1) and multipole expansion (see section 2.2). This work manages to preserve the optimal linear scaling of FMM codes with good performance even for small problems (see fig. 6) while using low storage compared to most other algorithms. The storage requirements are particularly important for large problems as well.
The paper is divided as follows. We present the relevant mathematical and algorithmic preliminaries of the method in the next section. Implementation details are given in section 3. In the numerics section (section 4) we compare the results for the problem of a homogeneously magnetized box to the analytic solution. Details related to analytical integration over tetrahedrons are stated in appendix A.

Fast Multipole Expansion [FMM]
The fast multipole method is a method for computing equations of the form where u i is the potential at the ith evaluation position, D n (r) is a so called kernel function, t i are called target-, s j source-points and Q j magnetizations or charges. For a dense kernel, this computation would need O N 2 operations, compared to the O N for the FMM. As pointed out in [3], the key features of the FMM are: • A specified acceptable accuracy of computation (in this case a flexible Multipole Acceptance Criterion for adjusting speed vs. accuracy; see section 3.2 and [22]) • A hierarchical division of space into source clusters (octree) • A multipole expansion of the kernel D n (r) • (for improved performance) A method for calculating the local expansion from the far-field expansion [21] A sketch of the the FMM algorithm is shown in fig. 1. It involves either the direct computation of the interaction of two vertices (P2P . . . Point to Point), or the following 5 steps: where steps 2 and 4 can be done as many times as necessary.

Direct Vertex Interaction
Each tetrahedron in the mesh is split into four tetrahedral hat functions. These hat functions are then identified with their vertices. The integration from one vertex v 1 to another vertex v 2 is then carried out by integrating over all hat functions that are non-zero at v 1 . This leads to the integration shown in eq. (2). The integral (solved in the sections below) results in a vector that describes the interaction from one vertex to another. For a similar approach consult [13]. In this method, targets are treated as simple vertices. A recently published method for integrating the interaction between two polyhedrons exists [9].

Tetrahedron Integration Overview
Integration of the kernel function over any tetrahedral domain can be decomposed to four integrals with the target point replacing one source vertex for each vertex [13]. A two-dimensional integral decomposition can be seen in fig. 2. This step simplifies the problem into four integrations of tetrahedrons with an r at a corner vertex. Before integrating the tetrahedrons the problem is rototranslated into a new frame of reference sketched in fig. 3. The target point v 1 is set into the z-axis. The other vertices are set into the xy-plane and the source vertices v 2 and v 3 are aligned parallel to the x-axis.

Source Magnetization
The integrals can be solved for a general linear magnetization. To this end, the magnetization M(r ) is split into a geometric part describing the linear where M t is the linear source field evaluated at the target vertex. And M s is the magnetization of the source vertex, which can be expressed in matrix form as 

Tetrahedron Corner Integration
The rototranslated tetrahedron integrals are all singular but can be transformed into regular integrals by following nonlinear transformation (see also [13]): x The transformed integral creates a prism which can be integrated in λdirection leading to following triangle integration: Note that a is a 3 × 4 matrix contracting withx,ŷ,ẑ and (1, 0, y t , 1) T . The triangle integration boundaries have been set to: x−2 y+ y− dy t dx t ): The resulting triangle integrals from eq. (3) are given in appendix A.

Source Expansion
For continuous sources an integral expression (in our case eq. (2)) must be expanded. To this end the Coulomb kernel 1/|r − r | is expanded as a Taylor polynomial of order p in terms of r around the origin (Cartesian multipole expansion) where n = n x , n y , n z being a multi-index, n! = n x !n y !n z !, |n| = n x + n y + n z and r n = x nx y ny z nz . From a source point (or field point, resp.) perspective the truncation error is proportional to (α/2) p+1 where α is the opening angle 2r/d with r being the diameter of source (or field) cell and d the distance between centers [21]. Plugging the expansion into the expression for the j-th volume source (eq. (2)) yields resulting in the following expansion coefficients: The expansion coefficients can be calculated exactly using a quadrature rule of order M = |n| + 1 because of the polynomial nature of the integrand, i.e.
using the volume of the tetrahedron |Ω j | and the quadrature points p i with associated weights w i from [24].

Local Expansion
In the target octree-cell the potential u(r) is approximated by a power series u(r) =  The local expansion coefficients L Ω,n can be computed recursively (see [21]).

Hierarchical Space Division
The key to the good scalability of the algorithm is the hierarchical space division. In three dimensions a cubic box enclosing the investigated space is successively divided into octrees creating smaller and smaller cubes of space (see fig. 4). The smallest (leaf) cells contain the vertex points of the mesh. These vertices are in turn connected to linear hat functions defined by the adjacent tetrahedrons creating the irregular mesh.

Multipole Acceptance Criterion [MAC]
The MAC is used to determine whether direct integration or multipole expansion is used for two interaction partners * . The MAC is fulfilled when the combined radii are smaller than the interaction partners' distance by a factor of θ (see fig. 5). In this case expansion is used, otherwise direct calculation. The cell radius is the distance between the cell center and the most remote vertex directly connected to the cell. A vertex is directly connected * interaction partners can be either a vertex or an octree-cell if it shares a tetrahedron with a vertex residing inside the cell. It is not the size of the cell itself.

Dual Tree Traversal
The dual tree traversal moves down the source and target tree simultaneously using the MAC to decide whether to do a multipole to local expansion (M2L) or move down a level in the larger tree and repeat the process for the newly created cell pairs. When both trees are at the leaf level a direct computation of the potential is done. The method and pseudocode for the dual tree traversal is reprinted from [22] in listing 1 and listing 2.

Storage Requirements
The algorithmic storage required is mainly composed of tree information, direct interaction coefficients, multipole coefficients, sources and targets. Obviously the source and target information scales linearly with the number of sources and targets. The direct integration coefficients stored are mainly a near field component; they scale linearly. The number of multipole and local coefficients also scales linearly but the proof is a bit more nuanced. Consider a tree with N vertices. The number of octree levels is equal to log 8 (N/a) where a is the number of vertices per leaf cell. The number of nodes per level L is 8 L giving a total of levels. Each level must store p(p+1)(p+2)/6 local and multipole coefficients. Which means it scales cubic in the multipole order. Summarizing, it can be said that the storage scales linearly in the number of sources and targets, but with cubic dependence on the multipole order p.

Numerics
This section compares the computational results of eq. (2) in an applied physical context. The near-field (P2P) evaluation described in section 2.1 was verified with randomly magnetized and randomly generated tetrahedra via numeric integration, leading to an error in the order of the numeric accuracy of the quadrature.

Comparison with FEM solvers
The present work implemented two implementations of the FMM for the use on tetrahedral grids. One for linearly magnetized meshes (Tet FMM) and another for point-like dipole clouds (Mag FMM). Focus was placed on speed and scaling. It shows that FMM is a fast usable algorithm for solving the stray-field problem which can be integrated in existing micromagnetic codes.   Figure 7 shows the difference of the stray-field energy to the analytically computed energy (E analytic = 1/6µ 0 M 2 S ). The E error is not constant because the multipole levels increase with the problem size and the real theta depends on the varying octree dissection of the mesh. As an estimate the constant error for more than 1 million vertices results to ∆E ≈ 10 −3 µ 0 M 2 S for maximum multipole order O = 4 and a ∆E ≈ 3 × 10 −4 µ 0 M 2 S for O = 6 respectively. A smaller error can be achieved by changing the MAC θ and multipole order at a performance cost.
The error of the Mag FMM code is caused by the non-continuous nature of the sources; it is much lower for atomistic scales when sources are set into the crystal lattice points and the problem becomes non-continuous. The theoretical optimum was computed using the analytically computed potential at each vertex. If the H-field is computed directly -which is the case for some other methods -lower errors can be achieved.
To show the scaling properties of the code strong scaling on a single node is shown in fig. 8. It is easy to see that the code scales nearly optimally up to the 16 available cores. This indicates that it is compute-instead of memorybound and that it would profit more by the next hardware generations than most other methods currently in use. ‡

Conclusion
Nowadays, many core algorithms' bottleneck is transfer speed and the trend seems to continue in that direction. The scaling properties of FMM are linear up to many processors and even GPUs [23]. The numeric results show that FMM computation today is comparable in speed to fast FEM implementations when the potential is required for calculation. In short the presented FMM is a good fit for micromagnetic calculations now and in the future.