On Architecture and Performance of Adaptive Mesh Refinement in an Electrostatics Particle-In-Cell Code

This article presents a hardware architecture independent implementation of an adaptive mesh refinement Poisson solver that is integrated into the electrostatic Particle-In-Cell beam dynamics code OPAL. The Poisson solver is solely based on second generation Trilinos packages to ensure the desired hardware portability. Based on the massively parallel framework AMReX, formerly known as BoxLib, the new adaptive mesh refinement interface provides several refinement policies in order to enable precise large-scale neighbouring bunch simulations in high intensity cyclotrons. The solver is validated with a built-in multigrid solver of AMReX and a test problem with analytical solution. The parallel scalability is presented as well as an example of a neighbouring bunch simulation that covers the scale of the later anticipated physics simulation.


Introduction
In todays state-of-the-art beam dynamics codes the well-known Particle-In-Cell (PIC) [1] technique has become indispensable. In contrast to the direct summation, where the force on a macro particle is obtained by the superposition of the forces due to all others, PIC models discretise a domain and deposit the charge of each macro particle onto a mesh in order to evaluate Coulomb's repulsion. In combination with the efficient parallelisation of such space-charge solvers using MPI (Message Passing Interface) or accelerators such as GPU (Graphics Processing Unit) and the MIC (Many Integrated Core) architecture, e.g. in [2], large-scale simulations were enabled that are more realistic. Nevertheless, multi-bunch simulations of high intensity accelerators such as cyclotrons require fine meshes in order to resolve the non-linear effects in the evolution of the beams due to space-charge. A remedy to increase the resolution, reduce the computational effort and also memory consumption is adaptive mesh refinement (AMR) [3,4]. In the context of Vlasov-Poisson problems, AMR was applied by [5] using the Eulerian description for the coordinate and velocity space. Examples for a Lagrangian formulation are the Unified Flow Solver (UFS) framework [6] and WarpX [7].
The diversity of today's computer architectures and the fast increase of emerging high performance computing technologies have shown that it is getting more and more infeasible to design a scientific software to one specific hardware only. It is therefore obvious that recent source code developments reveal a trend towards architecture independent programming where the backend kernels exhibit the hardware-specific implementation. An example are the second generation Trilinos packages that are built on top of the Kokkos library [8,9].
In this article the new AMR capability of the particle accelerator library OPAL (Object-Oriented Particle Accelerator Library) [10,11] using AMReX [12] is presented, as well as the built-in adaptive multigrid solver based on the algorithm in [13] and the second generation Trilinos packages Tpetra [14], Amesos2 and Belos [15], MueLu [16,17] and Ifpack2 [18]. The new implementation was benchmarked with the Poisson multigrid solver of AMReX and the analytical example of a uniformly charged sphere.
The new AMR feature of OPAL will enable to study neighbouring bunch effects as they occur in high intensity cyclotrons due to the low turn separation in more detail. Previous investigations such as [19] for the PSI (Paul Scherrer Institut) Ring cyclotron have already shown their existence but the PIC model was limited in resolution due to the high memory needs. It is hoped that the use of AMR will reduce the memory consumption for the mesh by decreasing the resolution in regions of void while maintaining or even increasing the grid point density at locations of interest in order to resolve the neighbouring bunch interactions more precisely. In [19] was shown that the interaction of neighbouring bunches leads to an increase at the tails of a particle distribution (i.e. increase of the number of halo particles) that usually causes particle losses and therefore an activation of the machine. Thus, it is essential to quantify this effect more precisely in order to do predictions on further machine developments with higher beam current.
Beside a short introduction to OPAL in section 2 and AMReX in section 3, section 4 discusses the AMR interface in OPAL. Section 5 explains the multigrid algorithm and its implementation using Trilinos with validation in section 6. A comparison of neighbouring bunch simulations with either AMR turned on or off is shown in section 7. The performance of the Poisson solver is discussed in section 8. In the last section are conclusions and outlook.

The OPAL Library
The Object-Oriented Parallel Accelerator Library (OPAL) [10,11] is an electrostatic PIC (ES-PIC) beam dynamics code for large-scale particle accelerator simulations. Due to the general design its application ranges from high intensity cyclotrons to low intensity proton therapy beamlines [20] with negligible space-charge. Beside the default FFT (Fast Fourier Transform) Poisson

solver for periodic and open boundary problems the built-in SAAMG (Smoothed Aggregation
Algebraic Multigrid) solver enables to simulate accelerators with arbitrary geometries [21]. The time integration relies on the second order Leapfrog, the fourth order Runge-Kutta (RK-4) or a multiple stepping Boris-Buneman method [22].
In beam dynamics the evolution of the density function f (x, p, t) in time t of the charged particle distribution in phase space (x, p) ∈ R 6 due to electromagnetic fields E(x, t) and B(x, t) is described by the Vlasov (or collisionless Boltzmann) equation with particle charge q and rest mass m 0 . The relativistic momentum p = γm 0 v with Lorentz factor γ and particle velocity v is used together with the coordinate x to specify the state of a particle in the 6D phase space. Both, the electric and magnetic field, in Eq. (1) are a sum of an external and internal, i.e. space-charge, contribution The external fields are given by RF-cavities and by the magnetic field of the machine. In order to evaluate the electric self-field the beam is Lorentz transformed into its rest frame where the magnetic field induced by the motion of the particles is negligible. Thus, the electric self-field is fully described by the electrostatic potential φ(x, t), i.e.
that is computed by Poisson's equation with charge density ρ and vacuum permittivity ε 0 . The magnetic self-field is afterwards restored by the inverse Lorentz transform. This quasi-static approximation is known as Vlasov-Poisson equation.

The AMReX Library
The AMReX library [12] is a descendant of the parallel block-structured adaptive mesh refinement code named BoxLib. It is C++ based with an optional Fortran90 interface. Each level is distributed independently among MPI-processes in order to ensure load balancing. The owned data is located either at nodes, faces, edges or centres of cells where the latter description is used in the OPAL-AMR implementation.
In order to generate a level l + 1 each cell of the underlying coarser level l has to be marked to get refined or not according to a user-defined criterion. In electrostatic problems natural choices are for example the charge density, the potential strength or the electric field (cf. Sec. 4.2).

Subsequent AMR levels satisfy the relation
where r w ∈ N \ {0} is called the refinement ratio and h l w specifies the mesh spacing of level l in direction of w. A sketch of a refined mesh is given in Fig. 1. By definition, the coarsest level (l = 0) covers the full domain Ω = Ω 0 whereas a fine level is defined by patches that may overlap several coarser grids. In general, for a level l > 0 with n grids g i following holds Although neighbouring grids aren't allowed to overlap they exchange data at interfaces via ghost cells. Figure 1: Sketch of a block-structured mesh refinement of a Cartesian grid Ω 0 in 2D with AMReX. Fine levels denoted by Ω 1 and Ω 2 may span multiple coarser grids as indicated. At interfaces among grids of same level ghost cells allow exchanging data.

Adaptive Mesh Refinement in the OPAL Library
In order to allow AMR and uniform mesh PIC algorithms, the interface in OPAL is implemented in a lightweight fashion where an AMR library is used as a black box. The AMR functionality is provided by concrete implementations of the abstract base class that defines common requirements on AMR libraries such as refinement strategies and mesh update functions. The actual AMR implementation is therefore hidden allowing multiple AMR dependencies.
In AMR mode the allocation of work among MPI-processes is controlled by AMReX. In contrast to OPAL where load balancing is optimised w.r.t. the macro particles, AMReX aims to achieve a uniform workload of grid operations. These two parallelisation paradigms are contradictory and cause additional MPI-communication for every PIC operation if both, grids and particles, are kept evenly distributed among the MPI-processes. In order to reduce communication effort at the expense of possible particle load imbalances the developed AMR interface distributes the particles according to their grids. For this purpose a new particle layout manager is created that stores further AMR specific attributes, i.e. the level and the grid a particle lives on.
A peculiarity of the PIC model in OPAL is the adjustment of the grid Ω 0 (cf. Fig. 1) to the particle bunch. The mesh that is co-moving with the macro particles adapts dynamically to the dimension of the bunch in rest frame, keeping the number of grid points per dimension constant, with the consequence of a constantly changing grid spacing. In longitudinal direction, i.e. the direction of travel, this change includes the correction of relativistic length contraction in laboratory frame. In AMR mode instead the macro particles are mapped to a fixed domain since the problem geometry has to be predefined in AMReX. This linear transformation includes the Lorentz transform of the particles. Adaptive mesh refinement, particle partitioning and the calculation of the electrostatic potential (cf. Sec. 4.1) are carried out there.
Spurious self-forces on particles close by coarse-fine grid interfaces that occur in AMR due to image charges are corrected by buffer cells as described in [23]. Another solution as depicted in [24] would be the modification of the charge deposition algorithm using a convolution of Green's function for particles near a refinement boundary.

Domain Transform
In order to prevent particles leaving the predefined domain of the mesh where the AMR hierarchy is built, they are mapped into a computation space denoted by S c for the evaluation of Poisson's equation, the repartition of the particles to MPI-processes and the mesh refinement. Therefore, the geometry can be kept at δS c where δ specifies a constant box increment in percent to increase the margin of the mesh. In the co-moving frame the natural choice of the computation space is S c = [−1, 1] 3 since the bunch is located around the design trajectory with the reference particle at (x, y, z) = (0, 0, 0). In order to consider an inhomogeneous problem domain, the box dimension of S c can be adjusted by the user at the beginning. After solving Poisson's equation the electrostatic potential and the electric field have to be rescaled properly. Instead of rescaling the fields at the location of the particles, it is directly done on the grid as depicted in Fig. 2. The mapping of the particle coordinates in co-moving space S p to computation space S c includes also the Lorentz transform.

Particle Coordinate
Let x = (x 0 , x 1 , x 2 ) ∈ S p be a coordinate of some particle in the particle space S p and let l = (l 0 , l 1 , l 2 ) > 0, then we define The transform of an individual particle at position x ∈ S p into computation space is therefore given by where N is the number of particles. respectively. The mapping of the particle coordinates in space Sp to Sc involves also the Lorentz transform.

Electrostatic Potential
Let φ ∈ S p be the electrostatic potential in particle space S p and φ * ∈ S c the corresponding potential value in computation space S c , then they relate as Proof. Let the discrete charge density of N particles be described by [25, eq. 1.6] in d dimensions and the coordinates being transformed as denoted above then Therefore, the potential transforms in 3 dimensions as denoted in Eq. (3). In 2 dimensions the electrostatic potential remains.

Electric Field
Let E ∈ S p be the electric field in particle space S p and E * ∈ S c the corresponding electric field vector in computation space S c , then they relate as Proof. According to Gauss' law the electric field is the derivative of the electrostatic potential.
Thus, an additional s −1 contributes to the transformation, therefore, that coincides with (4) in 3 dimensions.

Adaptive Mesh Refinement Policies
Beside the regrid function each AMR module implements the charge deposition, the particleto-core (re-)distribution and various refinement strategies. There are currently six refinement policies available. Most refinement strategies are directly connected to particle properties since it is desirable to increase the spatial resolution at their location. Natural choices of refinement criteria are the charge density per cell, the electrostatic potential and the electric field. They are explained in more detail below. Other methods limit the minimum or maximum number of particles within a cell. The last tagging option refines cells based on the momentum of particles.
While the first three methods refine the mesh based on the grid data, the latter methods use particle information directly. All methods apply a user-defined threshold λ in order to control the mesh refinement. This threshold denotes either the minimum charge density per cell or a scale factor λ ∈ [0, 1] in order to refine every grid cell (i, j, k) on a level l that satisfies in case of the electrostatic potential φ or the electric field components E w with w ∈ {x, y, z}, respectively. The charge density in Eq. (5) is scaled in order to account for the domain transformation as previously mentioned and explained in detail in Sec. 4.1. Examples of AMR based on the charge density, potential and electric field with various thresholds are shown in Fig. 3, Fig. 4 and Fig. 5, respectively.

Adaptive Geometric Multigrid
This section describes the algorithm of the adaptive geometric multigrid (AGMG) according to [27,13] and its implementation with the second generation packages of Trilinos, i.e. Tpetra [14], Amesos2 and Belos [15], MueLu [16,17] and Ifpack2 [18]. A cell-centred implementation is also presented in [28]. In opposite to previous implementations the one presented here is hardware independent thanks to the aforementioned Trilinos packages that have the Kokkos [8,9] library  as backend. Another benefit is the convenient exchange of kernels such as smoothers (e.g. Gauss-Seidel or Jacobi) provided by Ifpack2 or linear solvers of Belos, Amesos2 and MueLu. The sparse matrices and vectors are instances of Tpetra classes.

Coarse-Fine Interface
AGMG is a special variant of the classical geometric multigrid since not all levels cover the full domain Ω = Ω 0 (cf. Fig. 1). At interfaces between subsequent levels ∂Ω l,l+1 the elliptic matching condition (i.e. Neumann and Dirichlet boundary condition) must be satisfied in order to ensure continuity of the solution. This condition is met by flux differencing with mesh spacing h d an unit vector e d where either on Ω l or with d + , d − ∈ {1, 2, 3} \ {d} and d + = d − at the interface ∂Ω l,l+1 , i.e. the average flux across the boundary where a mesh refinement ratio (cf. Eq. (2)) of r d = 2 ∀d ∈ {1, 2, 3} is assumed. In case of a cell without adjacent finer cells the flux differencing reduces to the usual second order Laplacian discretisation An illustration of the stencil of Eq. (6) with fluxes computed either by Eq. (7) or Eq. (8) is shown in Fig. 6. In order to simplify the representation the example is in 2D with only one coarse-fine interface on the left side. Hence, the corresponding finite difference stencil is given by The red nodes indicate ghost cells that need to be interpolated.
x y (b) The red crosses specify the intermediate in- terpolation points using coarse cells. In 3D ghost cells are expressed in terms of valid coarse and fine cells where a two-step second order Lagrange interpolation in 2D with    (a) case 0 (b) case 1 (c) case 2 (d) case 3   Fig. 8). The first row highlighted in red indicates the example pattern on the right side.

Boundary Conditions
Assuming the beam in vacuum and neglecting any beam pipes the electrostatic potential converges to zero at infinity. In order to resemble this behaviour in finite difference a common approximation is the Asymptotic Boundary Condition (ABC) presented in [29,30] that is also denoted as radiative or open boundary condition (BC). The first order approximation ABC-1 is given by ∂φ(r) ∂r Instead to spherical coordinates a formulation in Cartesian coordinates is applied for example in [31,32,33]. In spherical coordinates the n-th order approximation (ABC-n) is easily evaluated where the product is computed in decreasing order and n ∈ N.
where d > 0 is an artificial distance. The condition is discretised using central difference. In addition to open BCs according to Eq. (12) the solver presented here allows to impose homogeneous Dirichlet and periodic BCs at the mesh (or physical) boundaries.

Algorithm and Implementation Details
Following the notation of [27,13], the full domain Ω is given by where the projection P from level l + 1 to level l satisfies P(Ω l+1 ) ⊂ Ω l . Due to the properties of the refinement Poisson's equation is described by on Ω with composite Laplacian operator L comp that considers only non-refined regions of each level.
The full algorithm is illustrated in matrix notation in Alg. 2 to Alg. 3. It performs a V-cycle in the residual correction formulation with pre-and post-smoothing of the error. The iterative procedure stops when the l p -norm of the residual of all levels with p ∈ {1, 2, ∞} is smaller than the corresponding right-hand side norm. Since AMReX assigns the grids to cores independent of the underlying level distribution, the implementation provides special matrices, i.e. B l crse and B l f ine , to handle the coarse-fine-interfaces. Thus, each AMR level stores up to ten matrices and four vectors represented by Tpetra objects. These are the composite Laplacian matrix A l comp , the Laplacian matrix assuming no-finer grids A l nf , the coarse boundary matrix B l crse and fine boundary matrix B l f ine , the restriction and interpolation matrices R l and I l , respectively, the gradient matrices G l and the matrix to get all uncovered cells U l . The vectors per level are the charge density ρ l , electrostatic potential φ l , residual r l and error e l . Whereas the vectors span the whole level domain, some matrices only cover a subdomain or carry additional information for the coarse-fine interfaces as shown in Fig. 9. The coarse and fine boundary matrices encompass one side of the Lagrange interpolation stencil that is completed by the Laplacian matrices. In case of the finest level the composite and no-fine Laplacian matrices coincide.

Poisson Solver Validation
The Poisson solver is validated using three different examples. First, the preservation of symmetry is tested. Second, a comparison with the analytical solution of a uniformly charged sphere in free space is shown. Although AMR is not turned on for a single-bunch simulation in the real application, it is nevertheless a good mini-app to check for any discontinuities at the coarsefine interfaces among levels. In a third example the solver is validated by means of the built-in Poisson multi-level (ML) solver of AMReX where 11 Gaussian-shaped bunches are placed in a chain using Dirichlet boundary conditions in the computation domain mimicking a multi-bunch simulation in high intensity cyclotrons as studied in [19]. The last two tests use the charge density to obtain the mesh refinements with threshold λ = 1 fC/m 3 (cf. Eq. (5) in Sec. 4.2).
All line and projection plots are generated with an own extension of the yt package [26]. In the following, a regular PIC model with a uniform single-level mesh, i.e. without refinement, is an AMR simulation of at most level zero.

Symmetry Conservation
In order to check symmetry preservation we initialise a three level problem where each level covers the centred region as shown below in Fig. 10a. At each level, the grid cells are assigned to the same charge density value, starting at 1 C/m 3 on level zero and increasing by 0.5 C/m 3 on each subsequent level. Therefore, cutting a line through the centre of the domain yields a perfectly symmetric electrostatic potential and anti-symmetric electric field components mirrored at the centre. According to Fig. 10b, the symmetry is preserved with absolute errors in the order of magnitude of machine precision and thus negligible.

Uniformly Charged Sphere in Free Space
In this mini-app 10 6 particles are randomly picked within a sphere of radius R = 5 mm centred at origin. In order to simplify comparison to the analytical solution each particle carries a charge of q = 4π 0 R 2 · 10 −2 C. Thus, the peak value of the electric field

11 Gaussian-Shaped Bunches
In this mini-app the newly implemented solver is compared to the Poisson solver of AMReX.

Neighbouring Bunch Simulation
As initially stated the new AMR feature in OPAL is mainly developed to study neighbouring bunch simulations (cf. Fig. 3) in high intensity cyclotrons [19]. This type of simulation injects The results are compared to the single-level execution where we use the root mean square (rms) beam size, i.e. σ w = w 2 (13) and the beam-profile parameter [34], which is a statistical measure to determine the proportion of halo particles in a beam, i.e.
where w n denotes the n-th moment of the particle distribution in coordinate w ∈ {x, y, z}.
In Fig. 14 and Fig. 16 are the rms beam sizes and in Fig. 15 and Fig. 17  compared to the regular PIC model. As observed in Fig. 18 and Tab. 5, however, the average resident set size (RSS), i.e. the amount of occupied physical memory, per MPI-process is on average at least four times smaller with AMR than FFT PIC. All simulations ran with 36 MPIprocesses.
Beside the memory benefit, AMR reduces also the time to solution as visualised in Fig. 19. The detailed timing results of the Poisson solver and fourth order Runge-Kutta integration for 5 and 11 neighbouring bunches are shown in Tab. 3 and Tab. 4. As expected, the particle integration grows in proportion to the increase in particles per bunch. The timings indicate that possible particle load imbalances do not harm the performance of the AMR PIC models significantly since the computation of the potential and electric field consume at least 87.7 % and 63.2 % in case of 10 5 and 10 6 particles per bunch, respectively. Overall, the runtime of the shown AMR configurations is at least 62.5 % shorter compared to FFT PIC.
The particle load balancing is quantified as the average number of particles per MPI-process N p s over all integration steps s divided by the total number of particles in simulation N t , i.e.
In the best case all MPI-processes have N t /P t particles during integration where P t is the total number of processes. Fig. 20 and Fig. 21 show the number of cores that deviate from the optimum particle count within a few percent. The load balancing between 10 5 and 10 6 particles per bunch does not differ significantly. A similar observation is done in Fig. 22 and Fig. 23 where the optimal number of grid points among the MPI-processes is evaluated.     Table 5: Average resident size (RSS) in Gibibyte (GiB) per MPI-process over all 360 integration steps with 5 or 11 neighbouring bunches (nbs) and 10 5 or 10 6 macro particles per bunch (ppb).

Performance Benchmark
The performance benchmark is done on the multicore partition of Piz Daint, a supercomputer at the Swiss National Supercomputing Centre (CSCS). The nodes on the multicore partition consist of two Intel Xeon E5-2695 v4 @2.10 GHz (2 × 18 cores, 64/128 GB RAM) processors [35]. The benchmark on the GPU partition of Piz Daint confirmed the hardware portability of the new solver. However, the data transfer between CPU (Central Processing Unit) and GPU as well as the launching of single GPU kernels for each matrix-vector or matrix-matrix operation of Tpetra showed a performance bottleneck which is why the performance study presents a CPU benchmark only.
The test initialises 11 Gaussian-shaped bunches as described in Sec. 6.3 with 10 6 macro particles of charge 0.1 fC per bunch. The Poisson problem is solved 100 times on a three level hierarchy (two levels of refinement) with 576 3 grid points on level zero. The particles are randomly displaced within −10 −3 , 10 −3 after every iteration. This represents a realistic setup for beam dynamics simulations since the particle distribution in the bunch rest frame changes only marginally from one integration time step to another. Therefore, it is not necessary to re-mesh the AMR hierarchy and thus rebuild the matrices after every time step which gives rise to computational savings. The optimal update frequency of the grids for neighbouring bunch simulations is currently unknown and is not subject in this article. Nevertheless, the computational saving is shown with two strong scalings. The first benchmark updates the AMR hierarchy after every computation of the electrostatic potential while the latter performs a regrid step after every tenth step. Since a constant workload per MPI-process during an upscaling that is necessary in a fair weak scaling can't be guaranteed, the presented benchmark consists of a strong scaling only.
The blue line in Fig. 24 shows the total solver time of the 100 executions. As indicated in Tab cores. However, the setup time can easily be reduced with a lower regrid frequency as previously mentioned. The matrix setup cost in the second timing is only 14 % of the setup cost observed by the first timing. Furthermore, the use of an algebraic multrigrid solver for the linear system of equations on the bottom level is not an optimal choice. More suitable would be a geometric multigrid that keeps the structure of the problem which is planned for a future paper.
The parallel efficiency of the strong scaling of Fig. 24 is shown in Fig. 25

Conclusion and Outlook
In this article we presented the new adaptive mesh refinement capability of the open-source beam dynamics code OPAL which has been enhanced by AMReX. The new feature is supplemented with a hardware architecture independent implementation of a multigrid Poisson solver based on second generation Trilinos packages. Beside an artificial problem illustrating symmetry preservation and a comparison with an analytically solvable problem, the Poisson solver was validated with the built-in AMReX multi-level solver. Although the structure of the mesh is lost when going to the matrix representation, the solver shows good scalability on CPUs with a parallel efficiency between 50 % and 60 % on 14'400 cores depending on the AMR regrid frequency.
The timings indicate that the matrix setup and the bottom linear system solver require 77 % of the total solver time. The former can be reduced by updating the mesh less frequently. The latter might be decreased by replacing the smoothed aggregation algebraic multigrid solver of MueLu with a structured aggregation procedure, a real geometric multigrid solver or a FFT solver which is subject to future research. Thanks to the hardware portability the solver runs on any backend that is supported by Kokkos. However, due to single kernel launches for each matrix-vector operation, the solver is not yet competitive on GPUs.
A small example of the PSI Ring cyclotron demonstrated the benefit of AMR over regular PIC models w.r.t. time to solution and memory consumption at a given accuracy. The presented benchmark shows that AMR requires about four times less memory and the time to solution is at least 62.5 % times shorter than a comparable simulation with the integrated FFT solver of OPAL. Therefore, the technique of adaptive mesh refinement will enable large-scale multi-bunch simulations in high intensity cyclotrons at higher grid resolution in order to more accurately quantify the effect of radially neighbouring bunches on halo formation and evolution.