Parallel-in-Time Integration of the Landau-Lifshitz-Gilbert Equation with the Parallel Full Approximation Scheme in Space and Time

Speeding up computationally expensive problems, such as numerical simulations of large micromagnetic systems, requires efficient use of parallel computing infrastructures. While parallelism across space is commonly exploited in micromagnetics, this strategy performs poorly once a minimum number of degrees of freedom per core is reached. We use magnum.pi, a finite-element micromagnetic simulation software, to investigate the Parallel Full Approximation Scheme in Space and Time (PFASST) as a space- and time-parallel solver for the Landau-Lifshitz-Gilbert equation (LLG). Numerical experiments show that PFASST enables efficient parallel-in-time integration of the LLG, significantly improving the speedup gained from using a given number of cores as well as allowing the code to scale beyond spatial limits.


I. INTRODUCTION
Computationally expensive simulations of large micromagnetic systems can be significantly accelerated by carrying out computations in parallel.Parallelism in space via distributed meshes or grids is commonly exploited in micromagnetic simulation codes, for instance in CPU software such as OOMMF [1], mumax [2] or GPU codes like magnum.np[3] or BORIS [4].However, the efficiency of this strategy drops sharply when the number of degrees of freedom per core gets too small, as the communication overhead begins to dominate the computational cost.Therefore, alternative avenues for parallelism such as computing multiple timesteps in parallel must be considered.This has already been attempted almost 60 years ago [5] but only recently gained greatly increased interest.As a result, a large variety of so-called parallel-in-time integration algorithms mostly based on Parareal [6] have been developed in the last decade [7], such as Multigrid Reduction In Time (MGRIT) [8] or the Parallel Full Approximation Scheme in Space and Time (PFASST) [9], to name a few.The aforementioned algorithms have been applied successfully to a variety of problems [10][11][12][13][14], enhancing parallel efficiency and allowing codes to scale much further than with spatial parallelism alone.Nonetheless, parallel-in-time integration has not yet been adapted to micromagnetics.In this work, we therefore investigate parallel-in-time integration of the Landau-Lifshitz-Gilbert equation (LLG) with the PFASST algorithm to gain an understanding of the potential benefits and difficulties.After explaining the numerical prerequisites and implementation, numerical experiments giving a first impression on the performance of the scheme are presented and discussed.

II. METHODS
In the following, basic concepts of micromagnetism and the prerequisites for the implemented PFASST scheme will be briefly explained, followed by a description of the algorithm itself.For a more comprehensive overview of micromagnetics refer to [15] and for more information regarding parallel-in-time integration, consider visiting the parallel-in-time.orgwebsite.

A. Micromagnetics and the LLG
The micromagnetic model strikes a balance between accuracy and efficiency by assuming that the quantum mechanical exchange interaction between neighboring spins is strong enough to align magnetic moments in parallel at distances much greater than the underlying lattice constant, a characteristic property of ferromagnets.As a result, the magnetization of a material can be written as a continuous, vector-valued function M (x) = M s m(x) where ∥m(x)∥ = 1 and M s refers to the constant saturation magnetization.Energy contributions influencing the magnetization are then expressed as integrals involving the magnetization function.Notable contributions to the total energy E are for example the Zeeman energy originating from an external field, the demagnetizing energy resulting from the field caused by the material itself, the aforementioned exchange energy, or effects such as anisotropy.Furthermore, E is used to define the so-called effective field H eff , where µ 0 ≈ 1.2566 × 10 −6 N/A 2 refers to the vacuum permeability and δ denotes the functional derivative.While the total energy can be minimized directly to find equilibrium configurations, the importance of the effective field lies in modeling magnetization dynamics when inserted into the LLG, where α refers to a dimensionless damping parameter modeling the relaxation of the magnetization toward the arXiv:2310.11819v1[physics.comp-ph]18 Oct 2023 field lines, whereas the reduced gyromagnetic ratio γ = γ e /µ 0 ≈ 2.2128 × 10 5 m/As determines the frequency of the precession around the field lines.Spatial discretization of the LLG is commonly done with either finite-difference (FDM) [1][2][3][4] or finite-element (FEM) [16][17][18] methods.In this work, the finite-element micromagnetic solver magnum.pi[16] is used.
It is based on firedrake [19], which enables distributing meshes and therefore spatial parallelism for finite-element discretizations of partial differential equations.
By default, time integration in magnum.pi is done with the SUNDIALS CVODE [20] module.
More specifically, a preconditioned, adaptive second-order Backward Differentiation Formula (BDF) timestepper described in [21] is employed.This fully implicit time integration method is suited well for dealing with the high stiffness introduced by the exchange interaction.

B. Spectral Deferred Corrections
The fundamental ingredient of PFASST is a class of timesteppers referred to as Spectral Deferred Correction (SDC) methods [22].Such timesteppers solve ordinary differential equations by iteratively correcting an initial approximation using only low order methods.For the sake of brevity, the method is commonly derived using a preconditioned Richardson iteration viewpoint [23,24] as follows: Consider an initial value problem in integral form, to be integrated until time t b .Constructing an SDC scheme carrying out this integration begins with the choice of outer quadrature nodes (t 1 , . . .,t m ) on the interval [t a ,t b ] according to some quadrature rule (e.g.Gauss-Legendre).Then, integrals of the right-hand side f to these outer quadrature nodes are approximated with the integration matrix Q ∈ R m×m : In other words, row i of matrix Q contains quadrature weights that integrate from t a to t i .Obtaining said weights usually involves an interpolation step [25].Inserting the approximations (4) into (3) gives rise to the so-called collocation problem which, when solved, gives an approximation to the solution at times t i , u = (u 1 , . . ., u m ) where u i = u(t i ).Here, u a = (u a , . . ., u a ) ∈ R m refers to a vector filled with the initial value, I m to the m-dimensional identity operator and F [u] = ( f (t 1 , u 1 ), . . ., f (t m , u m )) ∈ R m evaluates the right-hand side at the outer quadrature nodes.SDC methods now solve the nonlinear equation ( 5) by applying Richardson iteration with a preconditioner Q P : The preconditioner Q P usually is a simpler, lower triangular integration rule for the subintervals, e.g.backward integration.If the preconditioner is strictly lower triangular (e.g.forward integration), solving a (non)linear system for the iteration is not required and the scheme is explicit.Computing one iteration of ( 6) is often called an SDC sweep.Carrying out multiple sweeps sequentially improves the accuracy of the approximation.In fact, one SDC sweep raises the order of accuracy in the timestep by one or more (depending on the accuracy of the preconditioner), up until the order of the outer quadrature [26].Furthermore, using two separate preconditioners, one of which lower triangular (Q P im ) and one strictly lower triangular (Q P ex ) allows for constructing an implicit-explicit splitting (IMEX) version of SDC.With a correspondingly split right-hand side F = F ex + F im , the sweeps then look as follows: Multigrid methods outsource computationally expensive tasks to a coarser discretization where they can be carried out in a cheap manner.This can be used to accelerate convergence for SDC timesteppers and is used in PFASST to facilitate a cheap estimation of the solution at future timesteps.In so-called Multilevel SDC (MLSDC) methods, the Full Approximation Scheme (FAS) multigrid correction technique [27] is used to obtain a correction for the right-hand side integrals at the coarse level, where T c f refers to a restriction operator which can transform onto both a coarser mesh and time discretization (i.e.fewer nodes in the outer quadrature).Furthermore, Q represents the integration matrix on the coarse level.Adding τ to the right-hand side of ( 6) for the coarse level SDC sweeps enhances accuracy, often described as computing the solution at the resolution of the coarse grid but with the accuracy of the fine grid.Overall, a complete MLSDC iteration involves the following: (8) and carry out corrected SDC sweeps on the coarse level to obtain ũk+1/2 .

Prolong the correction to the fine level with the prolongation operator
and apply it to the fine level approximation: u k+1/2 = u k + δ k . 5. Carry out fine level SDC sweeps to obtain u k+1 .
Aside from the two-level example above, this can also be carried out on three or more levels [9,25].The structure of the MLSDC algorithm shown above is suited well for introducing parallelism in time.All that remains to do is strategically passing intermediate results forward, to be used as initial values for processes that calculate future timestep values.Consider Fig. 1, which depicts two-level MLSDC iterations that take place on multiple processes (P 0 , . . ., P 3 ), computing multiple timesteps ([t 0 ,t 1 ], [t 1 ,t 2 ], . . . ) concurrently.Aside from an initialization procedure at the beginning, the only additions to the MLSDC algorithm are passing the coarse level result forward to the next process upon completing coarse sweeps and passing fine level results forward to the next iteration on the next process.Regarding the initialization procedure, the repeated coarse sweeps shown in the Fig.

are a common choice.
There are many variants of PFASST with different communication patterns or other modifications.The pySDC library [29] for example, which we used to prototype the PFASST and SDC solvers implemented in this work, adopts a multigrid viewpoint.While this enables a more general implementation of the algorithm, it also decreases efficiency.
Here however we use the classical two-level variant depicted in Fig. 1.When compared to serial SDC, the parallel efficiency of this variant is at most K s /K p [30], where K s refers to the number of iterations required to solve the problem with SDC, whereas K p refers to the required PFASST iterations to solve the problem with the same accuracy.This is an improvement over the upper efficiency bound of 1/K for Parareal [6], where K is the number of iterations required for convergence.

E. Implementation
The first of many choices in implementing a PFASST scheme is choosing a numerical quadrature for the SDC sweeps.
Usually, quadrature types with the highest possible order given a certain number of nodes are chosen.Gauss-Legendre quadrature for instance integrates polynomials of up to degree 2 n − 1 exactly with n nodes and is optimal in this sense.On the other hand, Gauss-Legendre nodes on a given interval do not include the endpoint of said interval, where the desired solution might lie.Hence, an additional routine for calculating the endpoint is needed to integrate ODEs with such a quadrature type.This issue can be avoided entirely with quadratures that include the endpoint, such as Gauss-Radau quadrature which integrates polynomials of up to degree 2 n − 2 exactly with n nodes.In this work, both Gauss-Legendre and Gauss-Radau nodes are used, where the solution at the endpoint is computed by integrating with the outer quadrature weights if necessary.
Next, the preconditioners must be assembled.Common choices here are forward integration as an explicit preconditioner and the so-called LU-trick [31], as an implicit preconditioner which performs favorably for stiff problems.Both preconditioning methods are used in the code, because the LLG is not only stiff, but also a good candidate for implicit-explicit time integration: On the one hand, the stiffness is mainly introduced by the short-ranged exchange interaction, which is relatively cheap to compute and can thus be integrated implicitly without issue.On the other hand, the LLG permits a very straightforward splitting of the right-hand side into the contributions of individual field terms, as it is linear in the effective field.In the implemented code, users can choose which field terms are integrated explicitly or implicitly.One major benefit of using such an IMEX time integrator for the LLG is that repeated evaluations of the expensive demagnetizing field can be avoided.Therefore, we always use the IMEX variant of SDC in the numerical experiments of section III, where only the exchange field is evaluated implicitly.
The nonlinear system that arises in the SDC sweeps is solved with line search Newton iteration, more specifically the PETSc [32] implementation with the Flexible Generalized Minimum Residual Method (FGMRES) as the linear solver.The necessary right-hand side and Jacobian evaluation routines are already part of magnum.pi.Together with firedrake, this enables spatial parallelism.
Finally, transfer operators between the coarse and fine space/time discretizations must be implemented.In this work, the firedrake.multigridmodule is used for spatial transfers, as it allows for restriction and prolongation between distributed meshes.For the temporal domain, restriction and prolongation are realized with interpolation between the coarse and fine level quadrature nodes.
Regarding error control, serial SDC timesteppers have access to a very simple error estimate: Since the order of accuracy of the solution increases with every sweep, subtracting the solution values of the previous and current iteration gives an estimate for the local truncation error of the previous iterate.The implemented stepsize adaptation is then straightforward: If the error tolerance is not met, the stepsize is decreased by 50%, whereas the stepsize is increased by 25% after 10 accepted steps.Additionally, as long as the error exceeds 80% of the tolerance, the stepsize is not increased further.As an alternative to the error estimate, the maximum residual of the collocation problem (5) can be considered for error control.This is particularly useful for comparing the accuracy of SDC and PFASST results.
Lastly, it is worth mentioning that the implemented PFASST controller is very similar in structure to the classical two-level variant in Fig. 1, with some modifications for increased efficiency.For instance, reevaluating function values when returning from the coarse to the fine grid can be avoided by also correcting the function values in the same way as the approximate solution.Also, CPU idle times are avoided as much as possible by immediately starting computation on a new timestep once a timestep has converged, instead of waiting for a complete set of timesteps to converge and restarting the PFASST algorithm to compute the next set.

III. NUMERICAL EXPERIMENTS
In the following, we demonstrate the efficiency of the implemented timestepping methods by simulating various micromagnetic processes.
Exploring not only novel parallel timestepping methods but also new kinds of High Performance Computing (HPC) infrastructure, simulations are carried out on the Google Cloud.Here, the HPC toolkit [33] is used to first build a magnum.piVirtual Machine (VM) image and then create single VM instances or even entire SLURM [34] clusters on the cloud.In this work, the c2d machine type is used for all experiments.More specifically, unless stated otherwise simulations are carried out on c2d-highcpu-32 nodes with 16 physical processor cores each, where the compact placement rule for better inter-node communication has been enabled.For another example of micromagnetics code running on cloud services, consider [35].

A. Standard Problem 4
PFASST relies entirely on SDC methods for timestepping.As such, investigating the performance of serial SDC timestepping methods gives valuable insight about the potential performance of a PFASST scheme.This preliminary experiment examines SDC as a serial timestepping method for the LLG by comparing it to the BDF method normally employed in magnum.pi.Additionally, a first small-scale time-parallel run is attempted.As parallelism in space is not of interest here, we simulate the well-known µMAG Standard Problem #4 (SP4) [36].
First, we simulate SP4 with magnum.pi'sbuilt-in adaptive BDF solver and the adaptive SDC timestepper at varying tolerances.These runs are compared to a reference solution obtained with BDF at very strict (tol = 10 −10 ) error tolerances (Tab.I, Fig. 2).
On average, the SDC runs are slightly more accurate than comparable BDF runs but also significantly faster, especially  at high accuracy.The speedup is explained as follows: First, a large fraction of the computationally costly stray field evaluations is avoided by employing the IMEX variant of SDC instead of the fully implicit BDF method.Additionally, two Gauss-Legendre nodes were used for the high-accuracy SDC run.This results in a third order timestepping method as opposed to the second order BDF method.Next, we determine the error of individual SDC iterations on the first timestep of SP4.To this end, we use five Gauss-Legendre nodes for SDC and a BDF reference solution with even stricter tolerances (tol = 10 −14 ).Plotting the errors obtained this way against the iteration number confirms the theoretically predicted exponential convergence (Fig. 3).Also, one can clearly see how convergence stops after the accuracy of the quadrature is exhausted and how the error estimate gives a good approximation of the error until then.
Finally, since the PFASST controller does not yet support adaptive stepsizes, we use a fixed stepsize of dt = 10 −12 s and a relative residual tolerance of restol = 10 −6 to compute SP4 parallel-in-time on up to three cores (Tab.II).Here, the fine level mesh is the same as in the previous experiments with 15000 elements in total, whereas the coarse level mesh consists of only 600 elements.The residual tolerance is chosen such that the results roughly match the accuracy of  the SDC/BDF runs in Tab.I.As can be seen in Tab.II, using PFASST on a single core (i.e.MLSDC) is initially slower than using adaptive SDC.This relatively large slowdown is primarily caused by the already very small fine mesh size, which makes the additional overhead introduced in PFASST much more noticeable and also diminishes the speedup gained from transferring to the coarse level.Despite this, a parallel speedup of up to 2.16 is achieved with three cores in time.Note that the average relative error increases with the number of cores in time.However, this is expected and originates from the fact that one single PFASST iteration can significantly reduce the relative residual.Therefore, while all results meet the same residual tolerance, they vary in accuracy depending on how much the relative residual decreases below the tolerance on the final iteration of every timestep.It is also worth mentioning that time parallelism on more than a few cores is likely not very efficient for problems like SP4, unless very high accuracies are desired.This is due to the fact that the magnetization changes quickly in SP4, making future timesteps hard to predict.Consequently, parallel efficiency decreases disproportionately as more and more cores are used and the time interval that is computed in parallel grows.Nonetheless, for the specified residual tolerance the time-parallel PFASST runs already offer a small speedup over serial runs.

B. Airbox Hysteresis
Unlike the switching process in the previous subsection, magnetization generally changes slowly when computing hysteresis loops.This should improve parallel efficiency and allow using more cores for time parallelism.Therefore, we investigate this problem type next: We set a 1000 nm × 1000 nm × 20 nm rectangular domain to a constant initial magnetization of m = (0.01, 0, −1) T before normalization and then reverse the magnetization by simulating the LLG for t end = 2 ns with an external field of the form where e z refers to the unit vector in z-direction.The material parameters are the same as in SP4, except for the damping where we set α = 1.In other words, we compute one branch of a hysteresis loop in the direction of the hard axis.Because the hybrid finite-element-boundary-element (FEM-BEM) demagnetizing field solver in magnum.pidoes not yet support spatial parallelism, this is done with the airbox solver module and a surrounding airbox of the dimensions 4000 nm × 4000 nm × 400 nm.We use a mesh with 33033 elements on the coarse PFASST level and, after refinement by bisection, 264264 elements on the fine level.With seven Gauss-Radau nodes on the fine and coarse level, a fixed stepsize of dt = 2 × 10 −11 s and a relative residual tolerance of restol = 10 −8 , we obtain the runtimes in Fig. 4 on a single node.To ensure correctness, we also carry out a BDF run with tol = 10 −7 and plot the hysteresis curve against selected PFASST results (Fig. 5).Notably, up until eight cores, pure time parallelism is more efficient than pure spatial parallelism.This is most likely a consequence of the still relatively small mesh used in the simulations.On the other hand, the efficiency of the time-parallel approach drops quite quickly, such that using 16  cores in time actually slows the computation down compared to using only eight cores.Finally, Fig. 6 compares the scaling of space-parallel only runs with space-and time-parallel runs on one (solid lines) or several (dashed lines) nodes.Overall, by using 128 cores in total the runtime could be reduced by a factor of 20.56, where the speedup of 5.75 with spatial parallelism alone is enhanced by an additional factor of 3.58 through time parallelism.One should however note that the residual tolerance used in this numerical experiment is relatively strict for simulating such a simple problem, which increases the parallel efficiency of PFASST.On top of that, the coarse level resolution is already quite good, which similarly benefits PFASST.In any case, the results show that, unlike spatial parallelism, parallel-in-time integration can be used efficiently across multiple nodes.This observation is especially interesting in the context of micromagnetic GPU codes, where time parallelism could offer a straightforward way to implement multi-GPU extensions.

C. Magnonic damping
Finally, we consider a problem of oscillatory nature, where high accuracies are often desired.In [37], a spin pumping model for magnum.pibased on spin-diffusion is implemented and used to simulate the propagation of a magnon damped by spin torque.More specifically, the magnon is driven by an oscillating magnetic field on one side of a 1251 nm × 10 nm × 10 nm waveguide with uniaxial anisotropy along the y-axis and damped by a spin sink placed on the waveguide starting from y = 251 nm.Furthermore, the simulation is carried out with zero damping (α = 0) and with the demagnetizing field disabled.We simulate this magnonic damping process for 3 ns with the same waveguide structure and material parameters.As for the solver parameters, the PFASST runs utilize a coarse mesh with 84436 elements and a fine mesh resulting from bisection with 675488 elements alongside three Gauss-Radau nodes, a fixed stepsize of dt = 5 × 10 −12 s as well as a relative residual tolerance of rtol = 10 −5 .Because the spin torque calculations are relatively expensive on a mesh of this size, we now use the c2d-highcpu-112 machine type on the Google Cloud, with 56 physical cores per node.Once again, we first simulate on a single node and then extend the parallelism by computing timesteps in parallel on up to four nodes, for a total maximum of 224 cores.This results in the strong scaling depicted in Fig. 7.
At 56 cores, a speedup of 17.56 is achieved with pure spatial parallelism and an additional speedup of 2.08 when using four such nodes parallel-in-time for a total speedup of 36.51.As comparing the performance of PFASST to BDF is also interesting here, we compute a BDF solution with similar accuracy.For this experiment, the quantity of interest is the magnitude of the transversal magnetization component along the y-axis, m ⊥ , at t = 2.4 ns (Fig. 8).Therefore, we use this quantity to estimate the accuracies of our solutions by comparing to a reference BDF run with tol = 10 −7 (Tab.III).
As explained in subsection III A, although all PFASST runs meet the same residual tolerance, the error generally increases when integrating parallel-in-time.It is also worth noting again here that all field terms except for the exchange field are integrated explicitly when using PFASST.This includes the expensive spin torque evaluations and already roughly halves the runtime compared to BDF on a single core.Lastly, as opposed to the hysteresis simulation in the previous chapter, the error tolerances chosen for this numerical experiment are on the lower end.Higher accuracies are generally desired in this context, where PFASST should perform even better overall.

IV. CONCLUSIONS AND DISCUSSION
In this work, we implemented IMEX SDC/PFASST timesteppers for the LLG as extensions to the micromagnetic FEM solver magnum.pi.First numerical experiments lead to the following observations: IMEX Spectral Deferred Correction method timesteppers are competitive with fully implicit preconditioned BDF(2) methods at moderate accuracies.At higher accuracies, the flexibility of SDC methods enables straightforward construction of high-order IMEX timestepping methods for greatly improved performance.
Parallel-in-time integration of the LLG with decent efficiencies has been achieved across a variety of problem classes (Tab.IV), where the code generally performs better when high accuracies are desired.Combining space-and time-parallelism leads to better efficiencies than using only space-parallelism -in particular, parallel-in-time integration is also suited well for parallelism across multiple nodes, where it was successfully used to scale beyond spatial limits.
These preliminary results are best understood as a proof of concept for parallel-in-time integration in micromagnetic simulations, since many optimizations and improvements are yet to be made.For instance, further optimizing the linear and nonlinear solvers leads to faster convergence and enables larger timesteps, which would especially increase performance at low to moderate accuracies.Furthermore, enhancements to SDC such as linearly implicit formulations or variants like inexact SDC [38] or Krylov-SDC [24] can yield better convergence.Options for additionally parallelizing SDC "across the method" are also available [39] and the adaptiveness of the algorithm could be improved by increasing/decreasing the number of quadrature nodes automatically as needed.Similarly, adaptive timesteps could be implemented for the PFASST controller.There are also many other parallel-in-time integration algorithms besides PFASST that have yet to be tested in the context of integrating the LLG.Finally, extending GPU simulation codes to multiple GPUs with time parallelism seems promising due to the relatively straightforward implementation.This and the above improvements will be investigated in the future.

FIG. 3 :
FIG. 3: Relative L 2 -norm error e rel and error estimate at SDC iteration n on the first timestep of SP4.
FIG.4: PFASST timings for the hysteresis problem, solved with n s cores in space and n t parallel timesteps on a single node.

FIG. 5 :
FIG. 5: Hysteresis curves computed with BDF and PFASST,where n t time intervals are computed in parallel and H ext refers to the z-component of the external field.

FIG. 8 :
FIG.8: Transversal magnetization component magnitude along the y axis m ⊥ at t = 2.4 ns in the magnon damping simulations for selected time integration methods and tolerances.
[28]level PFASST iterations on four processes visualized with pfasst-tikz[28].End results of coarse (blue) and fine (red) sweeps are sent forward to the next process (arrows), where they are used as initial values.
D. Parallel Full Approximation Scheme in Space and Time

TABLE I :
Tolerances tol, runtimes t, average relative L 2 -norm errors e rel and number of external n ext , demagnetizing n dem as well as exchange n exc field evaluations for SP4.

TABLE III :
Number of cores used in space n s , parallel timesteps n t , runtimes t and average relative errors e rel of m ⊥ for the magnon damping problem.

TABLE IV :
Maximum speedups S over serial PFASST in the numerical experiments and corresponding parallel efficiencies η = S/n tot , where n tot refers to the total number of cores used.