A GPU-Parallelized Interpolation-Based Fast Multipole Method for the Relativistic Space-Charge Field Calculation

The fast multipole method (FMM) has received growing attention in the beam physics simulation. In this study, we formulate an interpolation-based FMM for the computation of the relativistic space-charge field. Different to the quasi-electrostatic model, our FMM is formulated in the lab-frame and can be applied without the assistance of the Lorentz transformation. In particular, we derive a modified admissibility condition which can effectively control the interpolation error of the proposed FMM. The algorithms and their GPU parallelization are discussed in detail. A package containing serial and GPU-parallelized solvers is implemented in the Julia programming language. The GPU-parallelized solver can reach a speedup of more than a hundred compared to the execution on a single CPU core.


Introduction
The space-charge effect is one of the most important topics in the study of beam physics.In recent years, the fast multipole method (FMM) has attracted increasing attention in the numerical modeling of the space-charge field [1,2,3,4,5].Compared to the particle-incell (PIC) method [6,7,8,9,10] which has been a standard choice in the community of accelerator physics for decades, the FMM has the following advantages in the context of the study of beam physics: 1. Through the point-to-point (P2P) operation, FMMs inherently consider the point-topoint Coulomb effects, e.g., disorder-induced heating and the Boersch effect which are non-negligible in the simulation for cold and dense beams [4].2. The FMM is a gridless algorithm and can effectively handle charged particle beams with complex geometry [2].
There have been many efforts using FMM for the modeling of electrostatic Coulomb interactions [11,1,4,3].The electrostatic model is suitable for the study of non-relativistic particle beams, e.g., the simulation of ultrafast electron microscopy [1,4] and the simulation of proton dynamics in synchrotrons [11].However, for the modeling of energetic electron beams, the consideration of the relativistic effect on the particle field may be essential.One approach to include relativistic effects is the quasi-electrostatic model [12,5] where all particles are assumed stationary in a rest-frame of the particle beam.The electrostatic field on each particle is first solved in the rest-frame and the corresponding field in the lab-frame is calculated through the Lorentz transformation [8,13].Because of the assumption made in the quasi-electrostatic model, some adjustments are necessary to incorporate the effect of the momentum spread.For example, to handle particle beams with larger energy spread, a technique called energy-binning was proposed by binning particles in energy; the total space-charge field comes from the superposition of the source particle field evaluated in the rest-frame of each bin [14,15].This study is the extension of our previous work on treecode [16] and consists of two parts.In the first part (Sections 2-6), we formulate an FMM for the efficient computation of the relativistic space-charge field.Our formulation is based on the barycentric Lagrange dual tree traversal (BLDTT) proposed in Ref. [17].
BLDTT uses barycentric Lagrange interpolation for the kernel approximation and dual tree traversal for the construction of interaction lists.Different from the quasi-electrostatic model, the proposed FMM is formulated in the lab-frame and can be applied without the use of the Lorentz transformation.We first introduce the idea of an interpolation-based FMM.After that, we formulate an interpolation-based FMM for the computation of the relativistic space-charge field.In particular, we derive a modified admissibility condition for the cluster-cluster interaction of the relativistic kernel used in the formulated FMM.The algorithms and the implementation details associated with the proposed FMM are also provided.The second part of this study (Sections 8-11) is devoted to the GPU parallelization of the proposed FMM.Different to Ref. [17] which is based on OpenACC (a directive-based programming model) [18], our parallelization is based on the CUDA programming model.We discuss the data structure and the design issues in the implementation of the GPU parallelization.A package containing serial and GPU-parallelized solvers is implemented in the Julia programming language.The performance of the parallel solver is also demonstrated.

The Idea of FMM
In this section, we give a short overview of the interpolation-based FMM [19,17].Consider two particle-clusters S t and S s .The total force-field f from the source particles in the cluster S s applied on a target particle with index i and position x i ∈ S t through an interaction kernel g(•, •) can be modeled as f (x i ) = j∈ Ss g(x i , x j )m j . (2.1) Here and in the following, S t and S s represent the index sets of the particles in S t and S s , respectively.The symbol m j is the physical quantity of the source particle with index j.
Although the actual meaning of m j depends on the physics problem we investigate, without loss of generality, we call it mass throughout this section.The idea of FMM for a fast evaluation of the summation in (2.1) is based on an approximation of the kernel function by interpolating both the target variable x i and the source variable x j g(x i , x j ) ≈ µ ν St,µ (x i )g(ξ St,ν , ξ Ss,ν ) Ss,ν (x j ). (2.2) Here, we define the bounding box Q of a cluster S as with a g = min i∈ S {g}, b g = max i∈ S {g} and g ∈ {x, y, z}.The Lagrange basis polynomials over the bounding boxes of S t and S s are defined as St,µ (x i ) := x St,µ 1 (x i ) • y St,µ 2 (y i ) • z St,µ 3 (z i ), Ss,ν (x j ) := x Ss,ν 1 (x j ) • y Ss,ν 2 (y j ) • z Ss,ν 3 (z j ), with the corresponding interpolation points ξ St µ := (ξ St,µ 1 , ξ St,µ 2 , ξ St,µ 3 ) and ξ Ss ν := (ξ Ss,ν 1 , ξ Ss,ν 2 , ξ St,ν 3 ).Substituting (2.2) into (2.1),we have (2.3)By observing (2.3), we identify the four of FMM kernels: • P2M (point to multipole): the micro particles in the cluster S s are aggregated into a few macro particles and the mass of each macro particle (M Ss,ν , also called multipole) can be computed by M Ss,ν := j∈ Ss Ss,ν (x j )m j . (2.4) • M2L (multipole to local): the multipoles of the source cluster are used to evaluate the force-fields acting on the macro particles (L St,µ , also called local field) in the target cluster S t L St,µ := ν g(ξ St,µ , ξ Ss,ν )M Ss,ν . (2.5) • L2P (local to point): in the target cluster, the effective force-fields acting on the macro particles are transferred to the micro particle at x i by • P2P (point to point): if S t and S s do not fulfill an admissibility condition (ADMC, also called multipole acceptance criteria MAC in some literature [20,21]) so that (2.2) is not applicable, the force-field can be calculated directly by This also applies for the case S t = S s = S, where i, j ∈ S and i = j.One main feature of the FMM is the consideration of the cluster-cluster interaction (M2L) through macro particles; and hence, the total number of operations for the evaluation of force-fields can be drastically reduced.In the FMM, we first partition all particles in the system into a hierarchy of clusters (cluster tree).If we directly use (2.4) to compute the multipole of each cluster, the number of operations for computing the multipoles of the whole cluster tree is with the assumption that the number of macro particles used for the approximation and the number of micro particles contained in the leaf cluster are both N r .To reduce the total number of operations for computing the multipoles, we can make use of the following property of polynomial interpolation stated in Theorem 1.
Theorem 1.If P (x) is a polynomial function of degree n, we have with S,k (x) denoting the Lagrange basis for polynomials of degree ≤ n for the interpolation point ξ S,k ∈ S.
This equality can be seen by the fundamental theorem of algebra since both LHS and RHS have the same values at the n + 1 points {ξ S,k | k = 0, . . ., n} and the RHS is a polynomial of degree n.By using Theorem 1, we can introduce two further procedures and two kernels of FMM: • Upward Pass: a source particle cluster S is subdivided into a hierarchy of clusters of the depth κ (called cluster tree [22,23]) .The multipoles of each cluster can be computed by the multipoles of its children clusters in view of (2.9) Equation (2.9) is the formula of the M2M (multipole to multipole) kernel.In FMM, the multipoles of the source cluster tree are updated by a procedure called upward pass.In the upward pass, the multipoles of the leaf clusters are first evaluated with P2M (2.4); and then, we start from the second deepest level of the cluster tree (i.e., level κ − 1) and apply M2M (2.9) to compute the multipoles of each cluster in each level (level by level).• Downward Pass: a target cluster S is subdivided into a cluster tree of the depth κ and each target particle (say particle i) will be contained in a sequence of clusters {S l | l = 0, . . ., κ} from each level l with S l+1 ⊂ S l and S 0 = S.The force-field on the target particle i can be calculated by (2.10) Here, we define the "cumulative local field" L S l , which follows the recursive relation Equation (2.10) can be proved by using (2.8), (2.11) and mathematical induction (cf.Section Appendix A).Therefore, during the downward pass of FMM, we first perform L2L (2.11) to calculate the cumulative local fields of the deepest-level cluster S κ ; afterward, we transfer L S κ ,µ to the target particles contained in S κ via L2P (2.10).

FMM Formulation for the Efficient Computation of Relativistic Space-Charge Field
Consider a relativistic charged particle beam moving in z-direction.Inside the particle beam, the space-charge field from a source particle with the position x j exerting to a target particle with the position x i can be approximately written as [16] E(x i , x j ) ≈ q 4π 0 γ j g(x, x j ) and B(x i , x j ) ≈ q 4π 0 c 0 p j × g(x, x j ), ( with the kernel function called "relativistic kernel" where γ j = 1/ 1 − β j 2 2 and p j = γ j β j are the Lorentz factor and normalized momentum of the particle, respectively, with β j the particle velocity normalized to the speed of light c 0 .Here, γ is the average Lorentz factor and can be computed by γ 2 = 1 + p • p with the average momentum of the particle beam p.Throughout this study, we assume that all particles in the particle beam are the same type with charge q. Given a target particle with the position x i contained in a target cluster S t , the spacecharge field from all the particles in the cluster S s experienced by this target particle can be computed approximately by applying (2.2) to (3.1), Here, we introduce the effective Lorentz factor and the effective momentum of a macro particle with the position ξ Ss,ν and index ν in the cluster S s as γ Ss,ν := j∈ Ss Ss,ν (x j )γ j and p Ss,ν := j∈ Ss Ss,ν (x j )p j .
Similarly, the effective electric and magnetic fields experienced by a macro particle with the position ξ St,µ and index µ in the target cluster S t are defined as

Admissibility Condition for Cluster-Cluster Interaction of the Relativistic Kernel
In the previous section, we used Lagrangian interpolation to approximate the spacecharge field on a target particle in a relativistic particle beam.It is also of importance to know when this approximation can be applied.To answer this question, we may investigate the interpolation error bound of the relativistic kernel Here, B t and B s are the interpolation error bounds with respect to the target variable x i := (x i , y i , z i ) and the source variable x j := (x j , y j , z j ): where the bounding boxes of S t and S s are Following the similar analysis in Ref. [16], we can derive the error bounds of B t and B s respectively as , where we define a stretched vector s := (s x , s y , s z ) = (1, 1, γ) with the average Lorentz factor γ in (3.2).The symbol • : R n × R n → R n denotes the component-wise product of two vectors.Here, the stretched diameter of a cluster and the stretched distance between two clusters S t and S s are (sx,sy ,sz ) dist (S t , S s ) := min Therefore, the interpolation error (4.1) can be bounded by and we can define an admissibility condition for the cluster-cluster interaction of the relativistic kernel by max with some admissibility parameter η ∈ R >0 which can be chosen to control the interpolation error bound.Besides deriving the stretched admissibility condition for the relativistic kernel, it is possible to bypass the mathematical analysis by using the Lorentz transformation.In the rest-frame of a particle beam with p = 0, the relativistic kernel g is approximately equal to the electrostatic kernel and the conventional admissibility condition can be used for controlling the interpolation error.However, as already discussed in our previous work [16], this approach can result in a larger error when a particle beam with larger momentum spread is considered because the distance of each target-source pair in the rest-frame of the particle beam is not correctly evaluated.Therefore, we will not discuss this approach in this study.

The Procedure of FMM
With the FMM kernels introduced in Section 2 and the stretched admissibility condition for the cluster-cluster interaction of the relativistic kernel derived in Section 4, we can formulate an FMM for the calculation of relativistic space-charge field.The proposed FMM (Algorithm 5.12) can be summarized in the following procedures: 1.A cluster tree using a k-d tree with a cardinality-balanced subdivision scheme [16, Section 3] is constructed from the particles in the system.To control the interpolation error subject to the relativistic kernel (3.2) with the stretched admissibility condition (4.3), the effect of the stretch also needs to be considered in the particle cluster subdivision [16].Hence, the cluster tree is constructed through Algorithm 5.2 with s = (1, 1, γ).
2. In the upward pass (Algorithm 5.5), the multipoles of each leaf cluster is computed by P2M (Algorithm 5.3) and are transferred to the multipoles of its ascendants by M2M (Algorithm 5.4).3. A list of interaction pairs is determined dynamically by the dual tree traversal [24,25].
The corresponding pseudocode is Algorithm 5.11.In the simulation of the space-charge effect for a particle beam, the roots of the target tree S t and the source tree S s are an identical cluster S, i.e., S t = S s = S.For the pair of clusters fulfilling the admissibility condition (Algorithm 5.8 with s = (1, 1, γ)), the local field of the target cluster is computed by M2L (Algorithm 5.9).For the interaction pair of leaf clusters, the forcefields on the particles in the target cluster are computed by P2P (Algorithm 5.10).
In our implementation, instead of using (4.2), we adapt a different definition to compute the stretched diameter and the stretched distance as illustrated respectively in Algorithm 5.6 and Algorithm 5.7 because of their simplicity in the practical implementation.4. In the downward pass (Algorithm 5.15), the cumulative local fields of each cluster are transferred to its descendants by L2L (Algorithm 5.13) and the cumulative local fields of each leaf cluster are transferred to the force-field on its member particles by L2P (Algorithm 5.14).A schematic comparison of treecode [20] and FMM is illustrated in Figure 1.In the treecode (Figure 1a), we interpolate the source variable x j of the kernel function so that we can cluster the source particles and build up the effective masses of each cluster.The force-field of each particle in the target cluster is evaluated by each independent traversal of the source cluster tree and by investigating the particle-cluster interaction.In the FMM (Figure 1b), we interpolate both target variable x i and source variable x j of the kernel function so that the target particles and source particles can be clustered and the corresponding effective force-fields and masses of each cluster can be built up.The force-field on a target particle is transferred from the effective force-fields of the clusters, which are computed by clustercluster interaction through the traversal of the target cluster tree and the source cluster tree simultaneously.
The pseudocodes for the algorithms are presented with the following global variables: E i electric field experienced by the i-th particle, B i magnetic field experienced by the i-th particle, γ S,ν effective Lorenz factor of a macro particle with index ν in cluster S, p S,ν effective momentum of a macro particle with index ν in cluster S, E S,ν total electric field acting on a macro particle with index ν in cluster S, B S,ν total magnetic field acting on a macro particle with index ν in cluster S.
The algorithms described above are implemented as a solver in the Julia programming language [26].The cluster tree constructed with Algorithm 5.2 is implemented using a pointer-based data structure.

upward pass interaction
< l a t e x i t s h a 1 _ b a s e 6 4 = " q i D E R W 3 m 9 J 4 r S 8 `S,⌫(⇠s 0 ,⌫ 0 )Ms0,⌫0

Results
To understand the performance of the proposed FMM, we first demonstrate a plot of elapsed time against the error for the simulations with different FMM parameters in Figure 2. In each simulation, 1.28 × 10 6 particles are randomly uniformly distributed in the unit cube [0, 1] 3 and each particle has the same momentum p = (0, 0, p 0 ) with p = (γ 2 − 1) 1/2 and γ = 50.The measured error is the maximal relative error in the electrical and magnetic fields, Algorithm 5.10: P2P S t : target particle cluster S s : source particle cluster Function P2P(S t , S s ) for i ∈ Ŝt do for j ∈ Ŝs do   where N is the number of particles in the system.The space-charge fields f t i and f b i experienced by the i-th particle are computed by FMM and a brute-force method ( Algorithm 5.10 with S t = S s = S and i = j), respectively.We can observe that a smaller admissibility parameter η (4.3) leads to higher accuracy (smaller error) but costs more elapsed time.This is because fewer M2Ls in the coarse level are performed and each non-admissible pair of clusters in the coarse level can result in many M2Ls in the finer level or P2Ps in the leaf level.
The usage of a higher interpolation degree n leads to a result with higher accuracy and higher elapsed time because more macro particles are used in the calculation of M2L.In Figure 3, the elapsed time of FMM against the number of particles N is presented.We can see that our FMM approaches the theoretical complexity O(N ) as the number of particles N becomes big enough (Figure 3b).
We also perform code profiling on our solver and demonstrate the cumulative elapsed time of the six FMM kernels in Figure 4.One can observe that the total elapsed time is mostly dominated by P2P and M2L; this indicates that the routines of these two kernels will be the focus when any further optimizations for the solver are considered.Besides, one can also observe a sudden jump in the value of the elapsed times for P2P and M2L at a specific number of particles N .To understand this phenomenon, we consider a case where the total particle number is equal to a transition value N = N κ t := 2 κ N 0 with κ the depth of the cluster tree.If each leaf cluster interacts with at most a constant number of clusters via P2P, the total number of operation counts to perform P2P can be written as When the number of particles N slightly increases with δN → 0 so that N > N κ t , the number of particles in each leaf cluster N f will slightly increase with δN f → 0 so that N f > N 0 .In this case, each leaf cluster will be subdivided into two clusters and the cluster tree will gain one more level κ + 1.Therefore, the number of leaf clusters will increase from 2 κ to 2 κ+1 and the value of N f reduces from N 0 to N 0 /2.Thus, the total number of operation counts for P2P can be written as with N 0 /2 ≥ δN f (N, N 0 ) > 0. The ratio between (6.2) and (6.3) for different δN f is for δN f → 0, 2 for δN f = N 0 2 .Together with (6.3), we can see that W P2P suddenly decreases to one half of W P2P (N κ t ) as N increases from N = N κ t and then grows quadratically until it is two times bigger than . This performance model can describe the trend of elapsed time for P2P.Likewise, we can also apply a similar analysis to M2L and write down the corresponding performance model as Here, we use the fact that a balanced cluster tree with the depth l contains 2 l+1 total clusters and the assumption that each cluster interacts with at most a constant value of clusters  The performance of the FMM method.Figure 3a shows the elapsed time used by FMM to evaluate the space-charge field of increasing numbers of particles N with interpolation degree n = 4, the maximum number of particles in the leaf cluster N 0 = (n + 1) 3 and admissibility parameter η = 0.5.Figure 3b shows the elapsed time normalized to N .
through M2L.Equation 6.4 shows W M2L suddenly increases to two times of W M2L (N κ t ) as N slightly increases with δN → 0 from N = N κ t ; and then it remains constant whenever t .This performance model can successfully explain the behavior of the elapsed time for M2L.

GPU Parallelization
As illustrated in Algorithm 5.11, our FMM is based on the dual tree traversal.The dual tree traversal could not be naively parallelized in data parallelism and might not benefit from GPUs.For one thing, the power of GPUs comes from executing multiple simple tasks through multiple threads in SIMD (single instruction, multiple data); for another thing, a single GPU core usually has weaker computing power than a single CPU core.Therefore, hybrid CPU-GPU approaches based on the creation of the interaction lists by CPU were investigated in some former works [17,27].In this approach, the CPU first performs a dual tree traversal to generate interaction lists; and then, the GPU handles the interaction of each pair of clusters in the interaction lists.In this study, we refer to the work proposed by Wilson et al. [17,21] and discuss a GPU parallelization for our proposed FMM.The CPU-GPU execution of the proposed FMM can be summarized in the 10 steps listed in Algorithm 10.The H2D and D2H denote the data transfers of "host to device" and "device to host", respectively.As shown in Figure 4, the execution of FMM spends most of the time on the interaction phases (P2P and M2L).Although the parallelization of each FMM kernel is implemented in our application, we will only focus on the implementations of P2P and M2L (i.e., step 6 and step 8 in Algorithm 10) in later discussions.
Algorithm 7.1: The CPU-GPU Ecxecution of the Proposed Parallelized FMM 1 CPU: generate particles information x i , p i and allocate E i , B i 2 CPU: create cluster tree S with x i , p i 3 H2D: copy x i , p i and S to device 4 GPU: allocate γ S,ν , p S,ν , E S,ν , B S,ν , E i , B i 5 GPU: perform upwardpass with S, x i , p i to compute γ S,ν , p S,ν 6 CPU: perform dual tree traversal on S to build interaction lists (ITLs) 7 H2D: copy ITLs to device 8 GPU: performe P2P, M2L with ITLs to compute E S,ν , B S,ν 9 GPU: perform downwardpass with E S,ν , B S,ν to compute E i , B i 10 D2H: copy E i , B i to host

Array-based Tree Data Structure
In the implementation, it might be straightforward to express the cluster tree with a pointer-based data structure; that is, each node object (particle cluster in our case) contains data fields and a pointer, and this pointer is used to allocate the objects of children nodes.One major disadvantage of using a pointer-based tree is that the node objects are not stored in contiguous locations in the memory, and this makes the data transfer between host and device difficult.Hence, it might be beneficial to consider an array-based tree in the GPU application.Following the approach in Ref. [28], we use multiple arrays to store node objects with multiple members, one array for one member.A member of a node object with index i is located in the i-th element of the corresponding array.Besides, two additional arrays are respectively used to specify the parent index and children indices of nodes.Because our cluster tree is constructed through a k-d tree with cardinality-balanced subdivision of particles, it will be a balanced binary tree.Hence, we will narrow our following discussions to balanced binary tree.
Although there may exist several possibilities, we adapt the breadth-first scheme to assign the node index of a tree.With this index assignment scheme, the nodes in the level l have the indices {2 l , . . ., 2 (l+1)−1 }; and similarly, a node with the index i belongs to a level log i/ log 2 .Here, we define that the level of cluster tree starts from 0 and the node index starts from 1.The breadth-first scheme can ensure that the member data of nodes from the same level stays contiguously in an array.This data arrangement is cache-friendly for both the upward pass (P2M and M2M) and the downward pass (L2L and L2P) where the whole member data of nodes from the same level will be accessed for the calculation.Therefore, the parent index and the children pair of indices for a node with index i are defined as A schematic representation of our array-based tree is provided in Figure 5. Since our tree is balanced (due to the cardinality-balanced subdivision scheme), we can preallocate a fixedsize array by knowing that the total number of nodes is 2 (κ+1) − 1 with the tree-depth It is worth noting that the resulting tree might not be balanced if other space-subdivision schemes are adapted.In this case, one may preallocate a big enough array that each node contains the children nodes of a maximum possible number.However, this causes large memory of unused nodes and leads to poor load-balancing across multiple ranks when MPI parallelization is considered [29].One possible way to work around this issue is first creating a pointer-based tree, and an array-based tree can be allocated based on the information from that.This approach is adapted by some solvers, e.g., BaryTree [30].

Parallelization of P2P and M2L Kernels
Two lists of interaction pairs respectively for P2P and M2L, called interaction lists (ITLs), are generated by the execution of the dual tree traversal (Algorithm 9.1) in the CPU and copied into GPU.The GPU kernels respectively of P2P and M2L are launched in a way that each interaction pair is handled by a thread block.As P2P and M2L are both similar to a direct summation algorithm, their GPU implementations are straightforward; one thread in the threads block handles the evaluation of the force-field of one target micro/macro particle (P2P/M2L).The GPU parallelization of P2P is illustrated in Figure 7.
In our implementation, we use an additional array to store the indices of all particles in the system and the indices of particles from a cluster will always stay in a contiguous memory block in the array during the subdivision (cf.Section Appendix B).However, each member data of particles (e.g., positions and momenta) from a cluster accessed through this Figure 5: An array-representation of a balanced binary tree.The node index is assigned with the breadthfirst scheme.
particle-indices array does not necessarily stay contiguous in its array (Figure 7a).Thus, a member data of source particles accessed by threads is non-contiguously distributed in an array.This can slow down the application because the data access is not cache-friendly and requires frequent access from the global memory.One way to remedy this problem is using the shared memory (Figure 6) provided in CUDA-capable GPUs: we first load each member data of particles from a source cluster to shared memory so that the data can be accessed much faster by threads (Figure 7b).For one thing, each member data of source particles stays in a contiguous block in the shared memory.For another, the shared memory is on-chip memory and has much lower latency than the global memory.For the M2L implementation, it is not necessary to apply the shared memory, because the member data of macro particles from a cluster is originally in a contiguous memory block.The data access is already cache-friendly so that the L1 cache in each streaming multiprocessor can be effectively used.A schematic of M2L implementation is provided in Figure 8.Because the implementations of the other FMM kernels share large similarities with P2P or M2L, we will not go through the details of the implementations.
Figure 8: An illustration of the implementation of the M2L kernel.The data associated with the macro particle of the target cluster and the source cluster are colored with blue and red, respectively.The thread is denoted by a shorthand "t".

Race Conditions in P2P and M2L Kernels
In CUDA applications, a GPU kernel can be launched with a grid of thread blocks and several thread blocks can be executed by streaming multiprocessors concurrently.In the execution of M2L or P2P, each pair of interaction is handled by one thread block and Algorithm 9.1: Generation of Interaction Lists by Dual Tree Traversal S i : particle cluster with node index i s: stretch factor η: admissibility parameter p2p itl: initially empty list of interaction for P2P (global scope) m2l itl: initially empty list of interaction for M2L (global scope) h w 5 s 3 w q G P I U S 8 x 3 U a T W w w / O 1 7 s x q p / 9 r z / 6 L F e k C R w a 9 T u 7 u N V i / z k 3 l t N U M O s 3 2 Z 3 e 4 P l r J N n q B X q E 3 K E B d d I Q + o B M 0 Q B R x 9 B V 9 Q 9 + 9 l 9 6 x 9 9 H 7 t E J r W 2 X M H q q I 1 / 8 L V x E t k w = = < / l a t e x i t > x i < l a t e x i t s h a 1 _ b a s e 6 4 = " i g y F k X a P F T N s u w g O 3 0 s w T V 8 J J J U a y P J r 5 5 r E z E y a c G e v 7 v z Z q 3 p 2 7 9 + 5 v P q g / f P T 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " g E C u 5 r E T q G k x 1 F w j 6 3 J g s P r 5 7 L 7 2 + 9 9 H 7 t E J r W 2 X M H q q I N / g L q n Y t X Q = = < / l a t e x i t > h w 5 s 3 w q G P I U S 8 x 3 U a T W w w / O 1 7 s x q p / 9 r z / 6 L F e k C R w a 9 T u 7 u N V i / z k 3 l t N U M O s 3 2 Z 3 e 4 P l r J N n q B X q E 3 K E B d d I Q + o B M 0 Q B R x 9 B V 9 Q 9 + 9 l 9 6 x 9 9 H 7 t E J r W 2 X M H q q I 1 / 8 L V x E t k w = = < / l a t e x i t > x i < l a t e x i t s h a 1 _ b a s e 6 4 = " i g y F k X a P F T N s u w g O 3 0 s < l a t e x i t s h a 1 _ b a s e 6 4 = " g E C u 5 r E T q G k x 1 F w j 6 3 J g s P r 5 7 L 7 2 + 9 9 H 7 t E J r W 2 X M H q q I N / g L q n Y t X Q = = < / l a t e x i t >   it is possible that several pairs of interaction with the same target index are handled by different thread blocks simultaneously.This can cause a race condition and produce an unexpected result because the corresponding memory data associated with a target cluster can be updated by the threads of different thread blocks at the same time (Figure 9a).One common remedy for the race conditions is using CUDA's atomic operations [31], which locks a memory location so that only one exclusive thread is allowed to update the value each time.However, the atomic operations in CUDA only support some primitive types (e.g., Int32 and Float32) and cannot be used in our implementation, because each threedimensional vector in the physical system (e.g., position, momentum and vector field) is represented by a non-primitive and immutable type SVector{3,T} [32] with three elements of a parametric type T. Due to this immutability, we cannot apply atomic operations to change any elements of a SVector{3,T} object even though T is a primitive type (if so we can apply atomic operations to update each element of a SVector{3,T} object).Therefore, in our implementation, we divide pairs of interaction into groups such that each group only contains the pairs of interaction with the same target index; and during the kernel execution, each group will be handled by a thread block.To implement this, we first sort the pairs by the value of target index, which can be done efficiently with Quicksort.After that, we generate an additional array to indicate the start position of each group of pairs in the sorted ITL so that this array can be used to dispatch thread blocks to each group during the kernel execution (Figure 9b).

Performance
In this study, a package FMM4RBGPU.jl is written in the Julia programming language [26] with CUDA.jl [33,34].This package provides CPU (serial) and GPU solvers for the FMM proposed in this study.The cluster tree used in this package is implemented with the array-based data structure discussed in Section 8.
To understand the performance of our GPU parallelization, we consider a simulation with N = 2.56×10 7 particles on different CPUs and GPUs as listed in Table 1.The elapsed times of the simulations (i.e., the execution of Algorithm 5.12) are demonstrated in Figure 10.We can see that our GPU-based solver can achieve a speedup between 57 and 197 (relative to the result of a single CPU).

Summary
In this study, we propose an interpolation-based FMM for the computation of the relativistic space-charge field.With our proposed modified admissibility condition, the FMM can be directly evaluated in the lab-frame without the need of a Lorentz transformation.We also consider a GPU parallelization for the proposed FMM.The pseudocode of the algorithms is provided and a corresponding package is developed in the Julia programming language.The proposed algorithms and package can be used to model the space-charge effect in the beam dynamics simulation of relativistic beams.

Appendix A. Definition of Cumulative Local Field
Lemma 1. Assume a target point x i is contained in a sequence of clusters {S l | l = 0, . . ., k} of each level l with S l+1 ⊂ S l and S 0 = S.The total force-field of the macro particles from this sequence of clusters transferred to this target point can be calculated by i , x j )m j ≈ µ St,µ (x i ) ν g(ξ St,µ , ξ Ss,ν ) j∈ Ss Ss,ν (x j )m j =:M Ss,ν = µ St,µ (x i ) ν g(ξ St,µ , ξ Ss,ν )M Ss,ν =:L S t ,µ = µ St,µ (x i )L St,µ .
interaction < l a t e x i t s h a 1 _ b a s e 6 4 = " T +v t 2 I b C c p M E P 2 P G E N N r 2 X 5 o i K c = " > A A A D q H i c h Z J L b 9 N A E I C 3 M Y 8 S X i l w 4 7 I i Q i o o i u w 0 T c K t g g O c S k B N U 9 R E 0 X o 9 T l b d h 7 W 7 b h u M / w t X + E f 8 G 9 a J K + G k E i t ZH s 1 8 8 5 4 w 4 c x Y 3 / + z U / P u 3 L 1 3 f / d B / e G j x 0 + e N v a e n R q V a g o j q r j S Z y E x w J m E k W W W w 1 m i g Y i Q w z i 8 + F D Y x 5 e g D V P y x C 4 T m A o y l y x m l F i n m j V e T A S x C 0 p 4 9 j n f P 5 5 w N c f H b 2 a N p t / 2 V w 9 v C 0 E p N F H 5 h r O 9 2 o 9 J p G g q Q 3 4 P s 8 6 B 4 M W P u i 2 c N D J N 6 m 5 B p A 3 3 D u H 9 P s t 7 G 9 h S + B c X d 1 w h 4 c O 7 N 4 K h j y F E v M d 1 O u 0 s M P z j e r M e q f / K 8 / + i x X h A k c G g 1 7 u 7 j X Y v M 5 t 4 b T T D n r t 7 p d u 8 6 h b X u 4 u e o l e o X 0 U o D 4 6 Q p / Q E I 0 Q R d / R T / Q L / f b e e k N v 7 H 1 b o 7 W d 0 u c 5 q j w v / A t w Z C 3 X < / l a t e x i t > O(N log N ) < l a t e x i t s h a 1 _ b a s e 6 4 = " R F G d b 2 J i 9 f b 3 Q 4 t z K s + g d S d 2 D D I = " > A A A D H H i c b V L f a 9 R A E N 5 E q / V a 7 V V 9 E 2 S x C F c 4 j q Q U 9 b H g i 4 8 V e 2 2 h O c J m M 8 m t 3 R 9 h d 3 P m C H n 1 0 T / B v 8 Y 3 6 a s g + M e 4 u Y u F X B 1 Y 8 u 0 3 3 8 x s Z i Y p O D M 2 C H 5 7 / r 3 7 W w 8 e b j 8 a 7 O w + f r I 3 3 H 9 6 b l S p K U y p 4 k p f J s Q A Z x K m l l k O l 4 U G I h I O F 8 n 1 + 9 Z / s Q B t m J J n d l n A T J B c s o x R Y h 0 V D x e R Z T y F O m 9 G U S L q q o n Z u A O f D y N S F F p V O D K l i O u W j m T Z N P k / a c 2 a M V 7 R F X O 3 T 7 E Z 3 4 o O I + B 8 g x v d Z o 6 H B 8 E k W B m + C 8 I O H J w c P / + 6 N X 9 5 c x r v e 9 + i V N F S g L S U E 2 O u w q C w s 5 p o y y i H Z h C V B g p C r 0 k O V w 5 K I s D M 6 l W D G v z a M S n O l H Z H W r x i e x H p g h W m i 6 n W Q Y O e o O 1 v 6 s r x X G l m 5 + K o X 7 I m w p i l S F w p Q e z c b P p a 8 r + + l r F K c d N P l 4 i x Z B Q y T W j f U R B N 2 r X o s 5 Y k J S e 6 2 n h 0 a b N 3 s 5 r J o r Q g 6 b o N W c m x V b j d B Z w y D d T y p Q O E u t 9 i F N O 5 K 0 C t 2 x i X S s I X O w e l Q d T d t 6 n P O u A m G G 7 O 6 y 4 4 P 5 q E b y b H H 9 0 o A 7 S 2 b f Q C v U I j F K K 3 6 A R 9 Q K d o i i j 6 4 / n e j r f r f / d / + D / 9 m 7 X U 9 7 q Y Z 6 h n / q + / 5 d 8 I k Q = = < / l a t e x i t > g(xi, xj) ⇡ X ⌫ g(xi, ⇠S s,⌫ )`S s,⌫ (xj) < l a t e x i t s h a 1 _ b a s e 6 4 = " i E w f c 2 + n B s K O B O H V b W + 9 8 + b P 1 + A = " > A A A D u X i c b Z L f a t s w F M a V e l u z 7 F + 6 X e 7 G r I y m Y D I n b Z q O M S j b x X Y T 6 N a m L d Q h y P J J r F W S j S S 3 C c K P s 0 c q b G + w p x i T E 5 f h p A e E D 9 / 5 H R 3 5 8 I U p o 0 r 7 / u / a h v P g 4 a P N + u P G k 6 f P n r 9 o b r 0 8 s h a 1 _ b a s e 6 4 = " C X D R 7 g V I a P 1 E N F b F A o x 9 7 C w c / b I = " > A A A D i n i c b Z L f b 9 M w E M e 9 h h + j D O j g k Z e I C W l I U U m 7 d e 1 A S B N D g h e k w d Z t a K k i x 7 m 2 7 m w n s p 2 1 l Z U / i P + A / 4 J X 9 t / g t N l D 2 p 0 U 5 a u 7 z / n s u 4 t S R p X 2 / d u N m v P g 4 a P H m 0 / q T 7 e e P X / R 2 H 5 5 r p J r 2 2 5 h 7 q R P T 1 L 8 6 l 9 v w = < / l a t e x i t > f (xi) = X µ `S,µ(xi)LS,µ < l a t e x i t s h a 1 _ b a s e 6 4 = " Z c v n / Y s 0 M o S c / l h H 0 U E m f T X w V a 0 = " > A A A D B n i c b V J N b 9 Q w E P W G A i V 8 b e E C 4 h J R o R Z p u 0 o q v i 5 I l b h w 4 F B E t 6 3 U r I L j T H a t 2 n F k T 8 q u r N w 5 8 V O 4 I a 6 c O H L n D 3 D i C m e c 3 Q i U L S N Z f n 4 z b 8 a e c V o K b j A M v / e 8 C 2 s X L 1 1 e v + J f v X b 9 x s 3 + x q 1 D o y r N Y M S U U P o 4 3 3 2 h 6 0 w E 0 w W p 3 X e X C 4 O 4 y e D B + / d q M c k K W t k 3 v k P t k m E X l K 9 s h L s k 9 G h J G v 5 C f 5 R X 5 7 7 7 2 P 3 i f v 8 z L U 6 7 W a 2 6 R j 3 p c / h 4 8 A M g = = < / l a t e x i t > LS,µ = X µ 0 LS0,µ0 • `S0 ,µ 0 (⇠S,µ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 k D 9 9 2 c v o x r 1 U X W 2 a K w v Y u l o T h s = " > A A A D q 3 i c b Z L d b t M w F M e 9 h o 9 R v j q 4 R E I W E 9 K Q Q p d 0 6 z o u k C a 4 g A u Q x r Z u E 0 s V O c 5 p a 8 1 O I t v Z W l m 5 5 A F 4 G 5 6 B N + A Z u O E W r n D a D J F 2 l q L 8 / T + / 4 + O c n C j j T G n P + 7 H S c G 7 c v H V 7 9 U 7 z 7 r 3 7 D x 6 2 1 h 4 d q z S X F P o 0 5 a k 8 j Y g C z h L o a 6 Y 5 n G Y S i I g 4 n E T n b 8 v 4 y Q r H Z 2 m X N W P i 4 S b M A p D S W g 9 k B F J y i G o u 5 p E O S d y s n D p X A 9 3 B 4 Y l W a 4 h o f M 2 D H O O d Y r L P 4 9 j J o F q P r W C U P t Z j G I 6 t g W o t v N h j 0 r g U o 8 h l S B M 9 S 7 M U S W a N h 7 D 0 L Z l 1 i 4 j p h L i w h y 8 e 1 O Y z t a u i 7 e 2 X e x 3 i k V q J A G S K + 6 V R X o 9 F 3 t L 2 B Q 4 T y + v u G 7 X g t v X g h H P o c I 8 C + 1 0 X G x x e z 0 7 Y v 7 i Q C 2 L 4 0 7 b 3 2 l 3 P 9 l Z c 9 F 8 r a I n 6 B n a Q D 7 q o T 3 0 H u 2 j P q L o O / q F f q M / z k v n 0 P n s B H O 0 s V L l P E a 1 5 c B f / k w 4 W w = = < / l a t e x i t > LS t,µ = X ⌫ g(⇠S t,µ , ⇠S s,⌫ )MS s,⌫ < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 A T m S H e V d s + w y I F O m w i c i x Y 1 m j w y r / Y i P 6 2 A 6 + B g s 1 + 3 w d l B f 3 D Y f / X R t T I g S 9 s h z 8 k L 0 i E D 8 p o c k Q / k h A w J 8 5 5 6 b 7 1 j 7 3 3 j R + N 3 s 9 n c W r o 2 v C r m C a l Z c / c P k Q 4 e L g = = < / l a t e x i t > g(xi, xj) ⇡ X µ X ⌫ `St,µ (xi)g(⇠S t,⌫ , ⇠S s,⌫ )`S s,⌫ (xj) interaction < l a t e x i t s h a 1 _ b a s e 6 4 = " l r 9 Q L g 6 e w y J G 8 t 8 X L m Y a Y i L / O v n w y I P 9 w Z t v N d t 4 y A s l q m J B p A P 3 E e H 9 P t t 7 K 9 g M + B c 3 T 1 w + / s O 7 D 4 K R j y D C v M d 1 A v b 2 O H F U n V m s d P / l W f / x c p w g S O D Q a 9 w 9 x o s X + e q c B 5 2 g l 6 n e 9 p t H X S r y 1 1 H 7 9 A O 2 k U B 6 q M D d I R O 0 B B R p N F P 9 A v 9 9 n a 8 I + / Y O 1 2 g j b X K 5 y 2 q P e / b X z I q K 1 4 = < / l a t e x i t > O(N ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 k D 9 9 2 c j2 b Ss g(xi, xj)mj < l a t e x i t s h a _ b a s e = " i E w f c + n B s K O B O H V b W + + b P + A = " > A A A D u X i c b Z L f a t s w F M a V e l u z F + X e G r I y m Y D I n b Z q O M S j b x X Y T N a m L d Q h y P J J r F W S j S S C c K P s c q b G + w p x i T E f h p A e E D / H R I U p o r / u / a h v P g a P N + u P G k f P n r o b r U k

< l a t
e x i t s h a _ b a s e = " C X D R g V I a P E N F b F A o x C w c / b I = " > A A A D i n i c b Z L f b M w E M e h h + j D O j g k Z e I C W l I U U m d e A S B N D g h e k w d Z t a K k i x m m w n s p l Z U / i P + A / J X t / g t N l D p U a u z / n s u t S R p X / d u N m v P g a P H m / q T e e P X / R H r p J

Figure 1 :
Figure 1: A schematic comparison of treecode and FMM.The blue dots and red dots indicate the target and source particles, respectively.

5 Figure 2 :
Figure 2: A plot of elapsed time against the error (6.1) for the proposed FMM.Each line represents the result computed with an admissibility parameter η = 0.3, 0.4, 0.5.Each point in a line represents a simulation with an interpolation degree n = 2, 4, 6, 8, 10 and the maximum number of particles in the leaf cluster N 0 = (n + 1) 3 .The point size is associated with the value of n; data with bigger n is expressed with bigger point size.

Figure 3 :
Figure3: The performance of the FMM method.Figure3ashows the elapsed time used by FMM to evaluate the space-charge field of increasing numbers of particles N with interpolation degree n = 4, the maximum number of particles in the leaf cluster N 0 = (n + 1) 3 and admissibility parameter η = 0.5.Figure3bshows the elapsed time normalized to N .

Figure 4 :
Figure 4: Elapsed time of the six FMM kernels against the increasing number of particles N with interpolation degree n = 4, the maximum number of particles in the leaf cluster N 0 = (n + 1) 3 and admissibility parameter η = 0.5.

Figure 6 :
Figure 6: Memory hierarchy of CUDA-capable GPUs.SP and SM denote streaming processor and streaming multiprocessor, respectively.
8 y S g k m t C q I y W a l G u r W i 2 J M k 7 0 1 U r T m U 1 6 o 5 z J N L M g 6 e I a k o x j q 3 C 5 K x w z D d T y m R O E u r E Y x X T q C l D r N u p S S b i 0 U 1 A a R L 7 8 F / n x U t S c P 4 b E X c v 8 u n I x 0 x A X + d G H d 0 X e 3 O / 5 e L / l 4 7 B Z r F I T D S B v u L c O 6 X Z 9 H K x h M + B c X d 5 w 7 b Y D W 7 e C E c 9 g i Q U O 6 j R 9 7 P B i p T u z 2 O n / 2 r P / Y m W 6 0 J F h r 1 O + 1 3 D 1 d a 6 L k 2 Y j 7 D R a n 9 z D D d D i b K E X 6 C V 6 g 0 L U R Q f o I z p E f U T R N f q J f q H f n u 8 d e V + 8 w Q L d 3 F j G P E W V 4 8 V / A a 4 m M m Y = < / l a t e x i t > p S s ,⌫ 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " d 6 z I k 6 Z h o 9 X e e / 5 q T x S S U s r f a v m d 8 8 P O M w 5 U w b 3 / + z s e k 9 e P j o 8 d a T 2 t N n z 1 9 s 1 3 d 2 z 7 T M F I U B l V y q Y U g 0 c J b A w D D D Y Z g q I C L k c B 5 e f i j 8 5 1 e g N J P u p i a L 1 e p y s u g 9 r d 9 0 2 c v 0 n u M K f 4 U 8 g c e K n w D p x J Z x U Y i T L o 5 l v H j s z Y c K Z s b 7 / a 6 v m 3 b l 7 7 / 7 2 g / r D R 4 + f P N 3 Z f X Z i V K o p D K n i S o 9 C Y o A z C U P L L I d R o o G I k M N p e P 6 u 8 J 9 e g D Z M y P2P without shared memory < l a t e x i t s h a 1 _ b a s e 6 4 = " b 7 n A u p i a L 1 e p y s u g 9 r d 9 0 m c v 0 n u M K f 4 U 8 g c e K n w D p x J Z x U Y i T L o 5 l v H j s z Y c K Z s b 7 / a 6 v m 3 b l 7 7 / 7 2 g / r D R 4 +

<
7 6 7 a R 8 Y U L d 6 7 w F / h B / B p Y J 6 6 E k 0 q M Z H k 0 8 8 1 j Z y Z K O T M 2 C H 5 v N b w 7 d + / d 3 9 5 p P n j 4 6 P G T 3 b 2n p 0 Z l m s K Q K q 7 0 K C I G O J M w t M x y G K U a i I g 4 n E U X b 0 v / 2 S V o w 5 Q c 2 E U K E 0 F m k i W M E u t M o 3 E k 8 r S Y s u n u f t A K l o I 3 l b B S 9 o 9 2 v s W / / n z 1 T 6 Z 7 j S / j W N F M g L S U E 2 P O w y C 1 k 5 x o y y i H o j n O D K S E X p A Z n D t V E g F m k i 8 b L v A r Z 4 l x o r T 7 p M V L a y 0 i v m S p q W K u V 0 H N G l C + N 3 b l + E x p Z u e i X S + Z E 2 H M Q k S u l C B 2 b t Z 9 p f F W X 2 m x S n F T T x c J X z I K i S a 0 7 k i J J u W a 6 l Z L o o w T f b 3 W d G a T / i R n M s 0 s S L o a Q 5 J x b B U u d 4 N j p o F a v n A K o e 5 Z j G I 6 d w W o d R t 0 q S R c 2 T k o D S K v / k U + q J S m 8 8 e Q u L E s x 5 W L h Y a 4 y D + / O y 7 y 9 k H f x w c d H 4 f t Y p 2 a a Q B 5 w 7 1 x S K / n 4 2 A D W w D n 6 u q G O z x 0 Y O d W M O I Z V F j g o G 7 b x w 4 v 1 r o z q 5 3 + r z 3 7 L 1 a m C x 0 Z 9 r u F u 9 d w / T o 3 l d N 2 K + y 2 O p / c 4 Q Z o J d v o O X q J X q M Q 9 d A R e o 9 O 0 B B R x N F 3 9 A P 9 9 F 5 4 x 9 4 H 7 + M K b W x V M c 9 Q T b z B X 0 7 L L O c = < /l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " 3 g Q z B L 1 Q a z e J b X A j e k y 8 h S k M

<
L d b 5 O H m o I 0 3 u 2 0 c h M U i N d Y A 8 o Z 7 6 5 B + v 4 3 9 J W w K n K u r G 2 5 r y 4 H d W 8 G I Z 1 B h v o N 6 Y R s 7 v F j o z s x 3 + r / 2 7 L 9 Y m S 5 w Z D D o F e 6 9 B o u v c 1 k c h Z 2 g 1 + k e + K 0 d H 8 3 P K n q J X q E 3 K E B 9 t I M + o n 0 0 R B Q B + o a + o x / e C 2 / b 2 / U + z N H G S h W z g W r H + / Q X o5 o p 8 w = = < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " 3 g Q z B L 1 Q a z e J b X A j e k y 8 h S k M S j E = " > A A A D k 3 i c h Z L L b t N A F I a n M Z c S b i 2 L d b 5 O H m o I 0 3 u 2 0 c h M U i N d Y A 8 o Z 7 6 5 B + v 4 3 9 J W w K n K u r G 2 5 r y 4 H d W 8 G I Z 1 B h v o N 6 Y R s 7 v F j o z s x 3 + r / 2 7 L 9 Y m S 5 w Z D D o F e 6 9 B o u v c 1 k c h Z 2 g 1 + k e + K 0 d H 8 3 P K n q J X q E 3 K E B 9 t I M + o n 0 0 R B Q B + o a + o x / e C 2 / b 2 / U + z N H G S h W z g W r H + / Q X o 5 o p 8 w = = < / l a t e x i t > i 1 2 3 4 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " l a u p i a L 1 e p y s u g 9 r d 9 0 2 c v 0 n u M K f 4 U 8 g c e K n w D p x J Z x U Y i T L o 5 l v H j s z Y c K Z s b 7 / a 6 v m 3 b l 7 7 / 7 2 g / r D R 4 + f P N 3 Z f X Z i V K o p D K n i S o 9 C Y o A z C U P L L I d R o o G I k M N p e P 6 u 8 J 9 e g D Z M y Y F d J D A R Z C Z Z z C i x z j Q a h y L r 5 1 M 2 3 d n 3 m / 5 S 8 K Y S l M r + k f f z z 4 / n v + F 4 u l u 7 H k e K p g K k p Z w Y c x b t e x i t s h a 1 _ b a s e 6 4 = " p 3 I D P h 3

Figure 7 :
Figure7: Implementations of P2P kernels (a) without shared memory and (b) with shared memory.The data associated with the target and source particle is colored in blue and red, respectively.The thread is denoted by a shorthand "t".
t e x i t s h a 1 _ b a s e 6 4 = " I d X d F c e / F t k 2 P 4 m s D 5 8 e F o b t 2 4 Y = " > A A A E h 3 i c j Z J b b 9 M w F M f T U m A L t w 4 e e b H o Q E O q S p J 2 b X l A G v A A j w O t 2 9 B a V Y 5 z 0 l p z n M h 2 t l Y h 4 h U + I p + D L 4 D T p L D 0 I r A U 5 e h / f u d m H z d i V C r L + l m p 3 q r d v n N 3 Z 9 e 8 d / / B w 0 f 1 v c e n M o wF g Q E J W S j O X S y B U Q 4 D R R W D 8 0 g A D l w G Z + 7 l + 8 x / d g V C 0 p C f q H k E o w B P O P U p w U p L 4 / q v / X 1 z q G C m k s i J h m O d I H 1 j D l 2 Y U J 6 4 A V a C z l L z Y E E s i i X 5 L 0 3 s t H l D l X / U l + g F + n / e 2 c q 3 N / L t r b y z j T e H w L 2 / s + h 5 x / W G 1 b I W B 6 0 b d m E 0 j O I c j / e q X 4 d e S O I A u C I M S 3 l h W 5 E a J V g o S h i k 5 j C W E G F y i S d w o U 2 O A 5 C j J O 8 B P d e K h / x Q 6 I 8 r t F B L E d 4 V j W Q R M 8 u D z B K Q P a 6 n y 7 F J K K i a B k 6 5 Z I I D K e e B q 0 v p K a d y 1 Z e J G 3 2 Z o s K Q y X I 6 N 2 h y S s A X m J Q d E R Y 4 2 8 m y q r A b M y x m K 0 3 H y u + P E s q j W A E n + T X 4 M U M q R N k i I o 8 K I I r N t Y G J H o s S R K a 6 A F F 6 X X U q D t d q C q G A I C n + a X J S G K b 2 e + D r a 8 m f O 5 g L 8 N L k 8 4 d 3 e h H a / S Z q d 5 r I d t J V a i I A + J J 7 r Z F e r 4 m s N W w O j I X X S + 7 w U I O d j a D L Y i g w S 0 N d p 4 k 0 n q 5 0 t 1 z G f 7 S n b m J Z O l u T d r + b 6 n 2 1 V 7 d z 3 T h 1 W n a 3 1 f n k N I 4 6 x e b u G E + N Z 8 a B Y R s 9 4 8 j 4 a B w b A 4 N U v l S + V b 5 X f t R 2 a 6 9 q 3 V o / R 6 u V I u a J U T q 1 t 7 8 B 5 S m E Q A = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " H 0 W d c e D W d j w Y E I c 4 j w P 0 C w g m I r 8 = " > A A A E j 3 i c j Z L P b 9 M w F M e T U W C E X y 0 c u V h s o C F F U 5 J 2 b T m A J j g A t 4 H 2 S 1 q q y n F e W m u O H d n O 1 i r k z B X + P P 4 W L j h t B k u 3 C S x F e f q + z / t l v y h j V G n P + 2 m v 3 W r d v n N 3 / Z 5 z / 8 H D R 4 / b n S e H S u S S w A E R T M j j C C t g l M O B p p r B c S Y B p x G D o + j 0 f e U / O g O p q O D 7 e p 7 B K M U T T h N K s D b S u P 1 r c 9 M J N c x 0 o Y T U E I f j L M j C s U l U v n H C C C a U F 1 G K t a S z 0 t l a k I u i x f J X F n 7 p X l L V H / U V e o n + n w 9 u 5 I N r + e 6 N f P c m 3 g m B x 3 9 n M X O P 2 x v e t r c 4 6 K r h 1 8 a G V Z + 9 c W f t a x g L k q f A N W F Y q R P f y / S o w F J T w q B 0 w l x B h s k p n s C J M T l O Q Y 2 K Z Q / o h V F i l A h p P q 7 R Q m 1 E x G c 0 U 3 X M b B n k N I D q k W N T j k 2 E p H q a B s 2 S B U 6 V m q e R K W W m n K p V X y V e 6 6 s U L Q R T z X R R 6 n J K I J G Y N B 0 Z l r j a z a a q c Z Q z L G c r T e c 6 G Y 4 K y r N c A y f L a 0 h y h r R A 1 U K i m E o g m s 2 N g Y k Z i x J E p q Y A 0 W Z t T S o O 5 3 o K Q k J a 1 P + y 2 K 8 N x / h j S M y 1 L J 8 7 n U u I y + L L h 3 d m c b p D F 3 V 7 L v K D c p W a S A B + w b 0 2 y G D g I u 8 K N g f G x P k F t 7 N j w N 6 1 Y M R y q D H P Q P 3 A R Q Y v V 7 q 7 W M Z / t K c v Y 1 U 6 3 5 D + s F + a f f V X t / O q c R h s + / 3 t 3 u d g Y 7 d X b + 6 6 9 c x 6 b m 1 Z v j W w d q 2 P 1 p 5 1 Y B E b 2 9 / s 7 / a P V q c 1 a L 1 t 7 S 7 R N b u O e W o 1 T u v T b 9 C a h 9 w = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " Z 7 A J / t Y N g I n o i Z Z w u f B X 1 Z i J 9 s Y = " > A A A D j 3 i c b V J N b 9 N A E N 0 m f B T z l c C R y 4 o U x C E K t p M 0 5 Q C K 4 E C 5 F d S 0 l e o o W q / Hy a r r X W t 3 3 S a y / J f 4 N V w 4 w G 9 h n b h C T r r S a J / e v J n Z m Z 0 w 5 U w b 1 / 2 9 1 2 j e u / / g 4 f 4 j 5 / G T p 8 + e t 9 o v z r T M F I U J l V y q i 5 B o 4E z A x D D D 4 S J V Q J K Q w 3 l 4 9 a X 0 n 1 + D 0 k y K U 7 N K Y Z q Q u W A x o 8 R Y a t Y 6 P j h w A g N L k 6 d + G s x s g m C W G l V 8 d I I Q 5 k z k Y U K M Y s v C 8 f B b 3 L c 2 s D Z 0 A h D R f 5 9 N M m t 1 3 J 6 7 P n g X e B X o o O q c z N q N 9 0 E k a Z a A M J Q T r S 8 9 N z X T n C j D K I f C C T I N K a F X Z A 6 X F g q S g J 7 m 6 5 Y L / M Y y E Y 6 l s i Y M X r O 1 i O i a p b q K W W 6 C n J q g n F h k y / G 5 V M w s E r 9 e M i e J 1 q s k t K V s l w u 9 7 S v J O 3 0 l Y 6 T k u p 4 u T L q C U Y g V o X V H S h Q p P 7 r O G h J m n K j l 1 q M z E x 9 N c y b S z I C g m z H E G c d G 4 v J 3 c c Q U U M N X F h B q 2 2 I U 0 4 U t Q I 3 d A Z t K w I 1 Z g F S Q 5 N V d 5 K c V cK w / g t i O Z T 2 u P F k p i I r 8 x 9 f P R e 7 3 j 7 q 4 P + h i z y + 2 V X M F I G 5 1 H 6 x k N O p i d 0 e 2 A s 7 l z a 1 u O L T C w Z 3 C k G d Q y V w r O v S 7 2 M r t 8 + y K e d s L t Q v O / J 5 3 2 B t 8 9 z v j Q b V s + + g V e o 3 e I Q + N 0 B g d o x M 0 Q R T 9 R L / Q H / S 3 2 W 6 O m p + a 4 4 2 0 s V f F v E S 1 0 / z 2 D x g X I 4 A = < / l a t e x i t >

Figure 9 :
Figure9: A schematic of the race condition problem in a naive parallelization of the P2P kernel (same for the M2L kernel) with ITL.Figure9aillustrates the original problem.Figure9billustrates a solution by dividing the pairs of interaction into groups with the same target index.The thread block is denoted by a shorthand "tb".

Figure 10 :
Figure 10: Elapsed times of simulations with N = 2.56 × 10 7 particles on different (a) CPUs and (b) GPUs.The simulation is performed with η = 0.5, n = 4 and N 0 = (n + 1) 3 .Each data point is the statistical result of 100 samples.
x j , p j ) end end end Algorithm 5.11: Cluster-Cluster Interaction by Dual Tree Traversal S t : target particle cluster S s : source particle cluster s: stretch factor η: admissibility parameter Function dualtraverseinteract(S t , S s , s, η) if children(S t ) == ∅ ∧ children(S s ) == ∅ then P2P(S t , S s ) (Algorithm 5.10) else isAdmissible = admissible(S t , S s , s, η) (Algorithm 5.8) if isAdmissible then M2L(S t , S s ) (Algorithm 5.9) else if children(S t ) == ∅ then for s ∈ children(S s ) do dualtraverseinteract(S t , s, s, η) end else if children(S s ) == ∅ then for t ∈ children(S t ) do dualtraverseinteract(t, S s , s, η) end else if diam(S t , s) > diam(S s , s) then for t ∈ children(S t ) do dualtraverseinteract(t, S s , s, η) end else for s ∈ children(S s ) do dualtraverseinteract(S t , s, s, η)

Table 1 :
CPUs and GPUs used in the simulations for the performance benchmark.