Performance of the BGSDC integrator for computing fast ion trajectories in nuclear fusion reactors

Modelling neutral beam injection (NBI) in fusion reactors requires computing the trajectories of large ensembles of particles. Slowing down times of up to one second combined with nanosecond time steps make these simulations computationally very costly. This paper explores the performance of BGSDC, a new numerical time stepping method, for tracking ions generated by NBI in the DIII-D and JET reactors. BGSDC is a high-order generalisation of the Boris method, combining it with spectral deferred corrections and the Generalized Minimal Residual method GMRES. Without collision modelling, where numerical drift can be quantified accurately, we find that BGSDC can deliver higher quality particle distributions than the standard Boris integrator at comparable cost or comparable distributions at lower cost. With collision models, quantifying accuracy is difficult but we show that BGSDC produces stable distributions at larger time steps than Boris.


Introduction
Computer simulations are a critical tool for the design and operation of fusion reactors [1]. Numerical computation of the trajectories of fast ions generated, for example, from neutral beam injection 1 is important to minimize wall loads and energy loss from ions escaping magnetic confinement [2]. At their core, particle trackers integrate the Lorentz equationṡ for a large ensemble of particles and use the resulting trajectories to generate statistical quantities like wall load. Note that fast ions in fusion reactors are typically not fast enough to require consideration of relativistic effects and so the non-relativistic Lorentz equations can be used.
Fast ions interact with the plasma and deposit energy, thus heating it. To compute steady-state distributions, trajectories need to be computed until the fast ions loose their energy through collisions and thermalize. This slowing down time over which an ensemble of trajectories needs to be calculated depends on the energy of the ions when injected and the bulk plasma parameters. For DIII-D, the required simulation time is around 0.1 s. For the larger JET, it is around 1 s. Because resolving gyro effects requires time steps of the order of nanoseconds, simulations involve many millions of time step, leading to substantial computational cost and thus long solution times. Simulating a full ensemble of fast ions in JET until thermalization takes several days, despite making use of modern GPU clusters. Some models, for example NUBEAM [3,4] or OFMC [5], use a guiding centre approximation where gyro effects are neglected or only included once a particle is near the walls. This allows taking larger time step and thus reduces computational cost. However, taking orbit effects into consideration is important to generate realistic wall loads [6]. Other particle tracking codes, for example ASCOT [7] and LOCUST [8], compute the full equations with gyro effects. For numerical methods, the only choices available are the Boris integrator [9], that is, a velocity Verlet or a Leapfrog scheme with a geometrical trick to resolve the implicit dependence on velocity, 2 and the Cash-Karp Runge-Kutta 4(5) method [10]. LOCUST also features a mover based on Strang splitting, which is very similar to Boris but avoids some issues around loss of accuracy in cylindrical coordinates [11]. There seems to be agreement that due to its significant energy drift, RK4(5) is less accurate than Boris and so the latter is typically used. Although Boris is surprisingly efficient [12], long solution times remain an issue: the time for injected fast particles to reduce their energy to that of the thermal plasma via collisions, called the collisional slowing down time, is on the order of seconds for a reactor like ITER. Resolving cyclotron dynamics requires timesteps on the order of nanoseconds so that simulations require computing around 10 9 steps. Even on CCFE's dedicated GPU cluster, such a simulation takes days to complete. While other algorithms have been proposed for solving the Lorentz equations [13][14][15][16][17][18], they have so far not been adopted for fast ion tracking.
Tretiak and Ruprecht [19] introduce BGSDC, a new high-order algorithm for solving the Lorentz equations based on a combination of spectral deferred corrections, a Generalized Minimal Residual (GMRES) iteration and the Boris integrator. They show improvements in computational performance over Boris for individual particle trajectories in a mirror trap as well as trapped and passing particles in a Solev'ev equilibrium. This paper extends their results by demonstrating performance for realistic test cases, studying practically relevant, aggregate quantities to assess quality of solutions instead of individual trajectories. Instead of analytically given idealized magnetic fields, we use toroidally axisymmetric free-boundary equilibria that are more closely representative of the actual conditions on DIII-D and JET (i.e. with a poloidal flux map that is expressed as a bicubic spline so as to generate a divertor X-point). In the non-collisional case, we track particle ensembles corresponding to real neutral beam injection (NBI) scenarios [2,20] and assess statistical distribution of particle drift in the ensemble in contrast to exploring accuracy of individual trajectories. We show that BGSDC delivers better distributions with smaller mean and standard deviation than Boris and, if means and standard deviations of the order of micrometres are desired, can deliver them with less computational work. Then, we investigate the case where models for collisions of fast ions with the plasma are active. While the stochastic nature of these results makes a quantitative assessment with respect to work versus precision difficult, we demonstrate that BGSDC can deliver similar results as Boris with larger time steps. Delivering additional evidence to show that these gains are enough to also deliver computational gains in the collisional case will require a more developed framework to quantify accuracy for the generated statistical distributions as well as substantially more computational results and is left for future work. Our results demonstrate that there can be a computational benefit from using BGSDC or other particle trackers with order of accuracy higher than two, in particular for high fidelity simulations with tight accuracy requirements.
All results were generated with the BGSDC implementation [21] that is now part of the LOCUST code and the ITER Integrated Modelling & Analysis Suite (IMAS) [22].

Methodology
The GMRES-accelerated Boris-SDC algorithm (or BGSDC for short) is described in detail by Tretiak et al. [19] whereas its predecessor, without GMRES-acceleration, was introduced by Winkel et al. [23]. This original Boris-SDC combined the Boris algorithm 2 ASCOT seems to use the Verlet variant whereas LOCUST uses the Leapfrog version.
introduced by Boris in 1970 [9] with spectral deferred corrections (SDC) introduced by Dutt et al. in 2000 [24] to generalize it to higher order. BGSDC incorporates a GMRES-based convergence accelerator for SDC, introduced by Huang et al. in 2006 [25] for first order problems, that leads to faster convergence to the collocation solution and thus improved long-term energy stability. All simulation results reported in this paper were generated using a BGSDC implementation in the GPU-accelerated LOCUST particle tracking code developed at CCFE.

Collocation methods
In essence, BGSDC is an iterative solver for a collocation method. Over one time step [t n , t n+1 ], the Lorentz Eqs. (1) written in integral form become with x 0 , v 0 being approximations of position and velocity at time t n brought forward from the previous step. Note that we consider only the case where the electric and magnetic field vary in space but not in time, but the method can easily be generalized to the non-autonomous case. Typically, the Boris method is based on a Leapfrog discretization of the differential form of the Lorentz Eqs. (1). Here, we use a variant based on the Velocity-Verlet method instead see Tretiak et al. for a discussion [19]. A geometric trick was introduced by Boris in 1970 [9] to avoid the seemingly implicit dependence on v n+1 . Birdsall and Langdon give a detailed description [26,. Collocation methods discretize the integral form (2) using numerical quadrature with nodes t n ≤ τ 1 < · · · < τ M ≤ t n+1 instead of the differential form. Approximate values x new and v new at time t n+1 are computed via where q m are quadrature weights while x m , v m are approximations to position and velocity at the quadrature nodes τ m . These are equivalent to the stages of a collocation method, an implicit Runge-Kutta method with a dense Butcher tableau [27,Theorem 7.7], and can be computed or approximated by solving the stage equations Depending on the choice of quadrature nodes, collocation methods can have a range of desirable properties. When applied to Hamiltonian systems in canonical coordinates, they are symplectic for Gauss-Legendre nodes [27, Theorem 16.5] and symmetric for Gauss-Lobatto nodes [28,Theorem 8.9] as well as A-stable [29,Theorem 12.9]. Symplectic integration in non-canonical coordinates is extremely challenging. By combining the stages x m , v m into one vector U, the stage Eqs. (5) can compactly be written as a nonlinear system with Q containing quadrature weights, and U 0 containing repeated entries of x 0 and v 0 . See Winkel et al. for details [23].

Boris-GMRES-SDC (BGSDC)
Spectral deferred corrections use an iteration based on a low order method to solve Eq. (6). For first order problems, this is typically an implicit or implicit-explicit Euler [24,30]. For second order problems, velocity-Verlet integration or, in the special case of the Lorentz equations, the Boris integrator can be used [23]. If the collocation problem (6) is linear and thus, in a slight abuse of notation, reads one can apply a preconditioned GMRES iteration instead of SDC to solve it [25]. The key point is that GMRES does not require assembly of the system matrix but only a function that applies I − QF to a given vector U. This simply means computing Eq. (5) for m = 1, . . . , M. However, to improve performance, it is advisable to use a preconditioner. In the GMRES interpretation, the low order base method (Euler in the first order case, Boris in the second order case) can be understood as a preconditioner, modifying the original collocation system (6) to where Q ∆t has a block structure with each block being a lower triangular matrix [19]. To apply GMRES to the preconditioned problem, a second function is required that can solve for a given right-hand side b [31]. Because of the special structure of Q ∆ , this can be done in a sweep-like fashion, very similar to the sweeps in the original variant of SDC. When using M = 3 nodes, solving Eq. (10) amounts to a block-wise solve of ⎤ ⎦ via first computing then and finally using Boris' trick to compute the velocities [19]. Note that for a single particle, where F = f, with given initial values x n , v n at the beginning of the time step, setting b 1 : ) and b 4 := v n + ∆τ 1 2 F(x n , v n ) means that (11) becomes identical to (3) with ∆t = ∆τ 1 . For specific choices of b 2 , b 3 , b 5 and b 6 , the steps (11), (12) and (13) correspond to a total of three Boris steps (3) with step sizes ∆τ 1 , ∆τ 2 and ∆τ 3 .
However, to apply the GMRES procedure, the algorithm is modified to accept any input for b. This procedure is straightforward to generalize for any number of nodes M.
For nonlinear collocation problems, it is possible to apply a Newton iteration and use GMRES-SDC to solve the linear inner problems. For the case where the nonlinearity is due to the magnetic field depending on x, this was found not to be competitive, as the gains in convergence speed could not offset the significant added computational cost [19]. Instead, we linearize the collocation problem by freezing the magnetic field after the first sweep provides values x 0 m for m = 1, . . . , M by approximating . . .
that is, the electric and magnetic field are evaluated at the approximate position x 0 m from the first sweep instead of using the positions x m in the argument U. The linearized system is preconditioned using (I − Q ∆ F lin ) and solved with GMRES. The result is then corrected to account for the nonlinearity by a small number of computationally cheap discrete Picard iterations If the fields only change weakly over a single time step, the solution of the linearized collocation problem will be very close to the nonlinear solution and the Picard iteration converges quickly in very few iterations. BGSDC(k, l) refers to this combination of k GMRES-SDC iterations for the linearized collocation problem followed by l Picard iterations for the fully nonlinear problem.
Note that ideally we would want to fix an overall tolerance and have the algorithm set k and l to reach this tolerance. However, the combination of GMRES iterations for the linearized collocation problem with nonlinear Picard iteration makes this difficult. Currently we do not have a good heuristic how to set k for a given nonlinear tolerance so that the Picard iteration converges reasonably fast but the GMRES iteration does not significantly oversolve the problem. Finding a way to do that efficiently will require a more comprehensive numerical exploration and is left for future work.

LOCUST-GPU implementation
LOCUST stands for ''Lorentz Orbit Code for Use in Stellarators and Tokamaks'' [8,32]. It is a software platform for solving efficiently the Lorentz equations of motion in the presence of a collision operator that models small angle Coulomb scattering. Kernels are instantiated upon Nvidia GPU hardware as PGI CUDA Fortran kernels which allows millions of Monte Carlo markers to be tracked in a typical simulation. LOCUST is being used extensively to design plasma facing components, e.g. for MAST-U [33], and for studying the physics of fast ion distribution and loss due to Neoclassical Tearing Modes and the application of ELM Control Coils for ITER. It is part of the EUROfusion HALO programme, where it is used to study the implications of finite gyroradius effects, e.g. for Toroidal Alfven Eigenmode (TAE) activity in Tokamak plasma [34].

Numerical results
We compare BGSDC against the Leapfrog-based staggered Boris method for tracking fast ions generated by NBI in both the DIII-D and Joint European Torus (JET) tokamak. For both reactors, we study the deterministic case without models for the collision of fast ions with the plasma and the stochastic case with collision models. In the collionless case, we launch a particle ensemble corresponding to a NBI shot and use the mean and standard deviation of the numerical drift distribution for individual runs to compare the quality of both integrators. In the presence of collisions with the plasma, stochasticity makes particle drift measurements of trajectories meaningless and a very large number of markers would be required to get statistically converged results. Therefore, to focus on the impact of the numerical error and minimize the spread of ensembles, we instead study many trajectory realizations for the same particle with identical initial conditions and analyse the effect that time step size has on the resulting distribution function profiles.
Orbit drift. LOCUST measures orbit drift by recording the maximum and minimum major radius at midplane crossing of the gyro-centre for the first 10% of simulation time. At this time, all orbits will have passed through the midplane multiple times. The gyro-centre is computed for a uniform field, which leads to the presence of some residual gyro-oscillation. This is the reason that multiple midplane traversals are necessary to find reliable values for minimum and maximum major radius. The procedure is then repeated for the last 10% of simulation time and drifts are computed by taking the differences of the measured maximum and minimum major radius.

Results for the DIII-D tokamak: non-collisional case
We compare Boris and BGSDC for full orbit simulations in a 2D equilibrium for the DIII-D NBI shot sketched in Fig. 1. This setup has been used, for example, in fast ion transport studies for applied 3D magnetic perturbations in DIII-D [35]. The fast particle birth list contains 10 000 unique Deuterium ions derived from an NBI source that is injected counter to the plasma current. If more particles are simulated, LOCUST creates copies of those particles to fill in information. To match the GPU architecture that LOCUST runs on, we model 64 × 64 × 16 = 65 636 particles, since we run 64 threads per block, with 64 blocks per grid on 16 GPUs. Fig. 2 shows the radial (left) and vertical (right) component of the DIII-D magnetic field. LOCUST uses bicubic splines of the poloidal flux to calculate field components from the equilibrium data.
The duration of all runs is 100 ms and, for simplicity, we do not include plasma facing components (PFC) in our simulations and instead stop particles when they hit the wall of the equilibrium computational box. While inclusion of PFC makes the detection of where a particle hits the wall much more difficult, they do not impact performance of the numerical integration scheme.
Numerical drift. Fig. 3 shows the distribution of numerical drift for four integrators available in LOCUST at final time t end = 100 ms for a particle ensemble in the non-collisional case. All integrators use a fixed time step of ∆t = 1 ns and the same initial conditions for particles. The High Field Side (HFS) 3 drift distributions are shown in the upper two graphs and σ indicates the standard deviation. The lower graphs show the Low Field Side (LFS) drift distributions. The Strang splitting mover and classical Boris deliver comparable accuracy at both sides of the plasma volume. However, the Strang splitting mover requires 3 Because the magnetic field in a Tokamak changes radially like 1/R with R being the major radius, the outboard side with large major radius is often referred to as Low Field Side (LFS) while the interior side, with small R, is called High Field Side (HFS).  more computational work than Boris and requires about 5% longer runtime. RK4 shows unsatisfactory result with very large σ , most likely due to its inherent energy drift. Since the Boris integrator performs better than SSM or RK4 we use it as a baseline to compare BGSDC against. BGSDC(1,3) is more accurate than Boris and Strang and for both LFS and HFS delivers a σ two orders of magnitude smaller. Of course, it also requires substantially more computational work per time step. Below we will demonstrate that this additional work per time step can be offset by using larger time step sizes and thus computing fewer steps, leading to computational gains when values of σ of around 1 µm or below are required.
Charged particles in a magnetic field can experience mirror trapping when entering regions of higher magnetic field strength as a consequence of conservation of the magnetic moment. These particles do not complete full orbits but instead experience a bounce point where their parallel velocity is zero and become trapped in the low field side. Because these sharp turns and resulting large gradients can affect performance of the numerical integrator [19], we show results separately for passing and trapped particles. Fig. 4 shows the resulting numerical orbital drift of each particle against the major radius R at the end of simulation at t end = 100 ms. The left figure is for the Boris method with time step 1 ns while the right shows results from BGSDC(2,6) with a time step of 5 ns. The upper graphs show trapped particles, the lower graphs passing particles where particles at the LFS are indicated by black markers and particles at HFS are marked in red. Note that the y-axes in the two lower graphs are scaled differently. For trapped particles, Boris and BGSDC deliver comparable results with particle drifts of up to 5 cm. For passing particles, BGSDC drifts are about an order of magnitude smaller than Boris, despite a five times larger time step.  particle cloud. The smaller σ is, the narrower and thus better is the distribution. At HFS, the standard deviation for BGSDC(2,6) is about nine times smaller than for Boris while at LFS it is a factor of around six more accurate.
Magnetic moment. While the Boris integrator conserves magnetic moment by design, for BGSDC this is only guaranteed for the underlying collocation. For small iteration numbers, nonconservation of magnetic moment cannot be ruled out by theory. However, in numerical experiments we confirmed that both Boris and BGSDC(2,6) conserve magnetic moment up to machine precision.
Accuracy versus time step size. Fig. 6 shows how the mean (upper) and standard deviation σ (lower) for the HFS (left) and LFS (right) changes with time step size. The higher order of accuracy of both BGSDC(1,3) and BGSDC(2,6) translates into a faster drop of σ with ∆t and thus a steeper slope of the curve. This demonstrates that there is an accuracy gain from using integrators of higher orders, not only for individual trajectories but also for the full ensemble. It also demonstrates that BGSDC can deliver a particle distribution with comparable accuracy with a much larger time step than Boris. For a value of σ = 1 µm for example, Boris requires a time step of ∆t = 0.1 ns whereas for BGSDC a twenty times larger step size of ∆t = 2 ns suffices.
Work-precision. BGSDC(2,6) with M = 3 Gauss-Lobatto nodes means we perform k = 2 iterations of GMRES-SDC on the linearized problem and l = 6 Picard iterations [19]. In contrast to Boris, which only needs one, BGSDC(2,6) therefore requires 19 right hand side (RHS) evaluations per time step. However, as shown in Figs. 4 and 5, it can provide comparable accuracy with a much larger ∆t and will thus require fewer time steps.
If the number of steps is sufficiently smaller to compensate for the increased work per step, BGSDC will be computationally more efficient.
To pinpoint where BGSDC starts to deliver computational gains, Fig. 7 (left) shows the number of total RHS evaluations required by Boris to reach some standard deviation σ divided by the number of evaluations required by BGSDC (1,3). A value of S u > 1 means that BGSDC(1,3) requires fewer evaluations and thus less computational work whereas S u < 1 means Boris requires fewer evaluations. We assume perfect second order convergence of Boris method to interpolate between data points, which is line with the behaviour we see in Fig. 6. For the HFS, the break-even point for BGSDC is for accuracies slightly below σ = 1 µm and gains increase sharply from there. A distribution with HFS σ = 0.5 µm will require only about 1/1.5 ≈ 66% as many evaluations as when using Boris. Gains are less pronounced when looking at LFS values, where the break-even point seems to be slightly above σ = 0.1 µm. Fig. 7 (right) shows the total runtime for LOCUST required to reach a given σ . To match a HFS of σ = 10 0 , BGSDC(1,3) requires 5980 s whereas Boris needs 13 169 s. This corresponds to a speedup of 2.2, which is somewhat better than what we predict from counting RHS evaluations.

Results for the DIII-D tokamak: collisional case
LOCUST uses the same Fokker-Planck collision operator to model collisions as ASCOT. A detailed description is provided by Hirvijoki et al. [36]. In the collisional case, we launch one particle 131072 times from the same position. The duration of each run is t end = 100 ms, the same as in the collisionless case, and we use the same 2D plasma equilibrium. We compare distribution functions generated by Boris and BGSDC. Distribution functions have four arguments: velocity, coordinates R and Z and pitch angle L. To visualize them in 2D plots, we fix velocity, R and Z and plot against the pitch angle L. Fig. 8 shows 1D cross-sections of the distribution function F (v, L, R, Z ) against pitch angle L = v ⊥ /v ∥ in a phase box for a trapped particle on the top with fixed velocity v = 1.8 × 10 6 m/s, major radius R = 2 m, axis Z = 0.5 m and for a passing particle at the bottom with v = 1.8 × 10 6 m/s, R = 1.9 m and Z = 0.3 m.
Boris and BGSDC both converge to the same distribution. However, BGSDC produces a stable distribution for larger time steps than Boris. For trapped orbits, the distributions provided by BGSDC with 1 ns and 10 ns time steps are very similar whereas Boris shows visible differences at a time step of 10 ns. For passing orbits, both methods require slightly smaller time steps to produce stable distributions. BGSDC has converged for a step size of 5 ns whereas Boris requires a smaller step size of 2 ns.

Results for the JET tokamak: non-collisional case
For JET, full orbit simulations were carried out in a 2D equilibrium representing typical JET conditions. We use an artificial source of particles uniformly distributed in pitch angle and inside the plasma volume at a fixed injection energy. All runs last 1 s and use 65536 unique particles. As in the DIII-D runs, we do not include PFCs.
Numerical drift.  Work-precision. Fig. 11 (left) shows again the ratio of RHS evaluations for Boris compared to BGSDC required to produce a distribution with a given σ . Results are very similar to those for DIII-D. BGSDC becomes competitive in the range between σ = 0.1 µm and σ = 1 µm for the HFS and LFS with better gains for HFS. Fig. 11 (right) shows runtimes needed to reach a given σ .
For JET, the break-even point comes earlier than for DIII-D. BGSDC is faster than Boris for distributions with σ around 3 µm or less for both HFS and LFS.

Results for the JET tokamak: collisional case
For the collisional case we launch one particle 131072 times from the same position and track it until simulated time 1 s. Fig. 12 shows 2D profiles of the distribution function in a minor cross-section of the JET tokamak. BGSDC with ∆t = 2 and ∆t = 5 ns as well as Boris with ∆t = 2 ns deliver comparable profiles.
Profiles computed with Boris with time steps 5 ns and larger show noticeable differences.   The shown results suggest that, for the collisional case, BGSDC(2,6) can use a time step about 5 to 10 times larger than the Boris integrator and produce results that look very similar. This is not enough to offset the fact that it needs 19 instead of one right hand side evaluations. If the distributions are to be reproduced with better accuracy (probably around 1% or better), we would expect BGSDC to become competitive, as was observed in the collisionless case. However, as we do not currently have a useful quantitative measure for accuracy for the collisional case, this analysis is left for future work.

Conclusions
We compare the performance of the BGSDC time stepping algorithm against the standard Boris integrator when computing trajectories of fast ions generated by neutral beam injection into a fusion reactor. Numerical examples are shown for the DIII-D and JET reactors and include non-collisional and collisional models with plasma. For the non-collisional case, the model is deterministic and we can use the standard deviation of the numerical drift distribution across all particles as a metric for solution quality. The results in this paper show that for both DIII-D and JET, BGSDC produces a substantially narrower and thus better distribution with smaller mean and standard deviation than Boris at the same time step. BGSDC can provide computational gains compared to classical Boris when trajectory simulations need to be of high precision with means and standard deviations of the order of 1 µm or lower. A recent study of toroidal Alfvén eigenmodes (TAE) with LOCUST required time steps of less than 0.5 ns to   deliver converged results [34]. Although no drift distribution was computed in these simulations, in our setup this would correspond to a standard deviation of around σ ≈ 10 µm. For studies involving higher frequencies and sharper spatial features, even smaller time steps will be required. Therefore, while most simulations do not yet require such highly accurate distribution, this will likely change in the near future. Detailed quantitative assessment of particle trajectories in the presence of collisions is left for future work, as there are no mathematical tools established in the fusion community to define a precise notion of accuracy in these cases. However, we show that BGSDC and Boris converge to similar distributions and that BGSDC provides stable distributions at larger time steps than Boris.