The design and verification of Mumax3

We report on the design, verification and performance of mumax3, an open-source GPU-accelerated micromagnetic simulation program. This software solves the time- and space dependent magnetization evolution in nano- to micro scale magnets using a finite-difference discretization. Its high performance and low memory requirements allow for large-scale simulations to be performed in limited time and on inexpensive hardware. We verified each part of the software by comparing results to analytical values where available and to micromagnetic standard problems. mumax3 also offers specific extensions like MFM image generation, moving simulation window, edge charge removal and material grains.


A. Material Regions
MuMax 3 employs a finite difference (FD) discretization of space using a 2D or 3D grid of orthorhombic cells. Volumetric quantities, like the magnetization and effective field, are treated at the center of each cell. On the other hand, interfacial quantities, like the exchange coupling, are considered on the faces in between the cells (Fig. 1).
In order to preserve memory, space-dependent material parameters are not explicitly stored per-cell. Instead, each cell is attributed a region index between 0 and 256. Different region indices represent different materials. The actual material parameters are stored in 256-element look-up tables, indexed by the cell's region index.
Interfacial material parameters like the exchange coupling are stored in a triangular matrix, indexed by the region numbers of the interacting cells. This allows arbitrary exchange coupling between all pairs of materials (Section III C).
Time-dependent parameters In addition to region-wise space-dependence, material parameters in each region can be time-dependent, given by one arbitrary function of time per region.
Excitations like the externally applied field or electrical current density can be set region-and time-wise in the same way as material parameters. Additionally they can have an arbitrary number of extra terms of the form f (t)×g(x, y, z), where f (t) is any function of time multiplied by a continuously varying spatial profile g(x, y, z). This allows to model smooth time-and space dependent excitations like, e.g., an antenna's RF field or an AC electrical current.

B. Geometry
MuMax 3 uses Constructive Solid Geometry to define the shape of the magnet and the material regions inside it. Any shape is represented by a function f (x, y, z) that returns true when (x, y, z) lies inside the shape or false otherwise. E.g. a sphere is represented by the function x 2 + y 2 + z 2 ≤ r 2 . Shapes can be rotated, translated, scaled and combined together with boolean operations like AND, OR, XOR. This allows for complex, parametrized geometries to be defined programmatically. E.g., Fig. 2 shows the magnetization in the logical OR of an ellipsoid and cuboid.
Web interface MuMax 3 provides web-based HTML 5 user interface. It allows to inspect and control simulations from within a web browser, whether they are running locally or remotely. Simulations may also be entirely constructed and run from within the web GUI. In any case an input file corresponding to the user's clicks is generated, which may later be used to repeat the simulation in an automated fashion.
Data format MuMax 3 uses OOMMF's "OVF" data format for input and output of all space-dependent quantities. This allows to leverage existing tools. Additionally a tool is provided to convert the output to several other data formats like paraview's VTK[3], gnuplot [1], comma-separated values (CSV), Python-compatible JSON, . . . , and to image formats like PNG, JPG and GIF. Finally, the output is compatible with the 3D rendering software MuView, contributed by Graham Rowlands [31].

III. DYNAMICAL TERMS
MuMax 3 calculates the evolution of the reduced magnetization m ( r, t), which has unit length. In what follows the dependence on time and space will not be explicitly written down. We refer to the time derivative of m as the torque τ (units 1/s): τ has three contributions: • Landau-Lifshitz torque τ LL (Section III A) • Zhang-Li spin-transfer torque τ ZL (Section III G) • Slonczewski spin-transfer torque τ SL (Section III H).

A. Landau-Lifshitz torque
MuMax 3 uses the following explicit form for the Landau-Lifshitz torque [18,20]: with γ LL the gyromagnetic ratio (rad/Ts), α the dimensionless damping parameter and B eff the effective field (T). The default value for γ LL can be overridden by the user. B eff has the following contributions: • externally applied field B ext • magnetostatic field B demag (III B) • Heisenberg exchange field B exch (III C) • Dzyaloshinskii-Moriya exchange field B dm (III D) • magneto-crystalline anisotropy field B anis (III E) • thermal field B therm (III F). Fig. 3 shows a validation of the Landau-Lifshitz torque for a single spin precessing without damping in a constant external field.

B. Magnetostatic field
Magnetostatic convolution A finite difference discretization allows the magnetostatic field to be evaluated as a (discrete) convolution of the magnetization with a demagnetizing kernelK: where M = M sat m is the unnormalized magnetization, with M sat the saturation magnetization (A/m). This calculation is FFT-accelerated based on the well-known convolution theorem. The corresponding energy density is provided as: Magnetostatic kernel We construct the demagnetizing kernelK assuming constant magnetization [27] in each finite difference cell and we average the resulting B demag over the cell volumes. The integration is done numerically with the number of integration points automatically chosen based on the distance between source and destination cells and their aspect ratios. The kernel is initialized on CPU in double precision, and only truncated to single before transferring to GPU.
The kernel's mirror symmetries and zero elements are exploited to reduce storage and initialization time. This results in a 9 × or 12 × decrease in kernel memory usage for 2D and 3D simulations respectively, and is part of the reason for MuMax 3 's relatively low memory requirements (Section VII).
Accuracy The short-range accuracy ofK is tested by calculating the demagnetizing factors of a uniformly magnetized cube, analytically known to be -1/3 in each direction. The cube was discretized in cells with varying aspect ratios along z to stress the numerical integration scheme. The smallest possible number of cells was used to ensure that the short-range part of the field has an important contribution. The results presented in Table I are accurate to 3 or 4 digits. Standard Problem #2 (V B) is another test sensitive to the short-range kernel accuracy [14].
The long-range accuracy of the magnetostatic convolution is assessed by comparing kernel and the field of a single magnetized cell to the corresponding point dipole. The fields presented in Fig. 4, show perfect long-range accuracy for the kernel, indicating accurate numerical integration in that range. The resulting field, obtained by convolution of a single magnetized cell (B sat =1 T) with the kernel, is accurate down to about 0.01 µT -the single-precision noise floor introduced by the FFT's.  Periodic boundary conditions MuMax 3 provides optional periodic boundary conditions (PBCs) in each direction. PBCs imply magnetization wrap-around in the periodic directions, felt by stencil operations like the exchange interaction. A less trivial consequence is that the magnetostatic field of repeated magnetization images has to be added to B demag .
In contrast to OOMMF's PBC implementation [22], MuMax 3 employs a so-called macro geometry approach [16,17] where a finite (though usually large) number of repetitions is taken into account, and that number can be freely chosen in each direction. MuMax 3 's setPBC(Px, Py, Pz) command enables P x , P y , P z additional images on each side of the simulation box, given that P is sufficiently large.
To test the magnetostatic field with PBC's, we calculate the demagnetizing tensors of a wide film and long rod in two different ways: either with a large grid without PBC's, or with a small grid but with PBC's equivalent to the larger grid. In our implementation, a gridsize (N x , N y , N z ) with PBC's (P x , P y , P z ) should approximately correspond to a gridsize (2P x N x , 2P y N y , 2P z N z ) without PBC's. This is verified in tables II and III where we extend in plane for the film and along z for the rod. Additionally, for very large sizes both results converge to the well-known analytical values for infinite geometries.

C. Heisenberg exchange interaction
The effective field due to the Heisenberg exchange interaction [10]: is evaluated using a 6-neighbor small-angle approximation [12,13]: where i ranges over the six nearest neighbors of the central cell with magnetization m. ∆ i is the cell size in the direction of neighbor i.
At the boundary of the magnet some neighboring magnetizations m i are missing. In that case we use the cell's own value m instead of m i , which is equivalent to employing Neumann boundary conditions [12,13].
The corresponding energy density is provided as: MuMax 3 calculates the energy from the effective field using Eqns. 6, 8. The implementation is verified by calculating the exchange energy of a 1D magnetization spiral, for which the exact form (Eq.7) is easily evaluated. Fig. 5 shows that the linearized approximation is suited as long as the angle between neighboring magnetizations is not too large. This can be achieved by choosing a sufficiently small cell size compared to the exchange length.
Inter-region exchange The exchange interaction between different materials deserves special attention. A ex and M sat are defined in the cell volumes, while Eq. 6 requires a value of A ex /M sat properly averaged out between the neighboring cells. For neighboring cells with different material parameters A ex1 , A ex2 and M sat1 , M sat2 MuMax 3 uses a harmonic mean: which can easily be derived, and where we set S = 1 by default. S is an arbitrary scaling factor which may be used to alter the exchange coupling between regions, e.g., to lower the coupling between grains or antiferromagnetically couple two layers.

D. Dzyaloshinskii-Moriya interaction
MuMax 3 provides induced Dzyaloshinskii-Moriya interaction for thin films with out-of-plane symmetry breaking according to [7], yielding an effective field term: The left-hand and righhand sides correspond to a Bloch and Néel wall, respectively. Results correspond well to [35].
where we apply boundary conditions [30]: Numerically, all derivatives are implemented as central derivatives, i.e., the difference between neighboring magnetizations over their distance in that direction: When a neighbor is missing at the boundary (∂V ), its magnetization is replaced by m + ∂m ∂i | ∂V ∆ i n where m refers to the central cell and the relevant partial derivative is selected from Eq. 11-15.
In case of nonzero D, these boundary conditions are simultaneously applied to the Heisenberg exchange field.
The effective field in Eq.10 gives rises to an energy density: Similar to the anisotropic exchange case, MuMax 3 calculates the energy density from Eqns.17, 10. Eq.16 is the exact form, well approximated for sufficiently small cell sizes.
In Fig. 6, the DMI implementation is compared to the work of Thiaville et al. [35], where the transformation of a Bloch wall into a Néel wall by varying D ex is studied.

E. Magneto-crystalline anisotropy
Uniaxial MuMax 3 provides uniaxial magneto-crystalline anisotropy in the form of an effective field term: where K u1 and K u2 are the first and second order uniaxial anisotropy constants and u a unit vector indicating the anisotropy direction. This corresponds to an energy density: MuMax 3 calculates the energy density from the effective field using Eq.20, where B anis (K ui ) denotes the effective field term where only K ui is taken into account. The resulting energy is verified in Fig. 7. Since the energy is derived directly form the effective field, this serves as a test for the field as well.
Cubic MuMax 3 provides cubic magneto-crystalline anisotropy in the form of an effective field term: where K cn is the nth-order cubic anisotropy constant and c 1 , c 2 , c 3 a set of mutually perpendicular unit vectors indicating the anisotropy directions. (The user only specifies c 1 and c 2 . We compute c 3 automatically as c 1 × c 2 .) This corresponds to an energy density: which, just like in the uniaxial case, MuMax 3 computes using the effective field: which is verified in Fig. 7.

F. Thermal fluctuations
MuMax 3 provides finite temperature by means of a fluctuating thermal field B therm according to Brown[9]: where α is the damping parameter, k B the Boltzmann constant, T the temperature, B sat the saturation magnetization expressed in Tesla, γ LL the gyromagnetic ratio (1/Ts), ∆V the cell volume, ∆t the time step and η(step) a random vector from a standard normal distribution whose value is changed after every time step.
Solver constraints B therm randomly changes in between time steps. Therefore, only MuMax 3 's Euler and Heun solvers (IV) can be used as they do not require torque continuity in between steps. Additionally, with thermal fluctuations enabled we enforce a fixed time step ∆t. This avoids feedback issues with adaptive time step algorithms. Verification We test our implementation by calculating the thermal switching rate of a single (macro-)spin particle with easy uniaxial anisotropy. In the limit of a high barrier compared to the thermal energy, the switching rate f is know analytically to be [8]: Fig. 8 shows Arrhenius plots for the temperature-dependent switching rate of a particle with volume V =(10 nm) 3 and K u1 =1×10 4 or 2×10 4 J/m 3 . The MuMax 3 simulations correspond well to Eq.25.

G. Zhang-Li Spin-transfer torque
MuMax 3 includes a spin-transfer torque term according to Zhang and Li [38], applicable when electrical current flows through more than one layer of cells: where j is the current density, ξ is the degree of non-adiabaticity, µ B the Bohr magneton and B sat the saturation magnetization expressed in Tesla. The validity of our implementation is tested by Standard Problem #5 (Section V E).

H. Slonczewski Spin-transfer torque
MuMax 3 provides a spin momentum torque term according to Slonczewski [33,37], transformed to the Landau-Lifshitz formalism: where j z is the current density along the z axis, d is the free layer thickness, m P the fixed-layer magnetization, P the spin polarization, the Slonczewski Λ parameter characterizes the spacer layer, and is the secondary spin-torque parameter.
MuMax 3 only explicitly models the free layer magnetization. The fixed layer is handled in the same way as material parameters and is always considered to be on top of the free layer. The fixed layer's stray field is not automatically taken into account, but can be pre-computed by the user and added as a space-dependent external field term.
As a verification we consider switching an MRAM bit in 160 nm × 80 nm × 5 nm Permalloy (M sat =800×10 3 A/m, A ex =13×10 −12 J/m 2 , α=0.01, P = 0.5669) by a total current of -6 mA along the z axis using Λ=2, =1. These parameters were chosen so that none of the terms in Eq.28 are zero. The fixed layer is polarized at 20 • from the x axis to avoid symmetry problems and the initial magnetization was chosen uniform along x. The MuMax 3 and OOMMF results shown in Fig. 9 correspond well.

A. Dynamics
MuMax 3 provides a number of explicit Runge-Kutta methods for advancing the Landau-Lifshitz equation (Eq.2): • RK45, the Dormand-Prince method, offers 5-th order convergence and a 4-th order error estimate used for adaptive time step control. This is the default for dynamical simulations.
• RK32, the Bogacki-Shampine method, offers 3-th order convergence and a 2nd order error estimate used for adaptive time step control. This method is used when relaxing the magnetization to its ground state in which case it performs better than RK45.
• RK12, Heun's method, offers 2nd order convergence and a 1st order error estimate. This method is used for finite temperature simulations as it does not require torque continuity in between time steps.
• RK1, Euler's method is provided for academic purposes.
These solvers' convergence rates are verified in Fig. 10, which serves as a test for their implementation and performance.
Adaptive time step RK45, RK23 and RK12 provide adaptive time step control, i.e., automatically choosing the time step to keep the error per step close to a preset value 0 . As the error per step we use = max |τ high − τ low | ∆t, with τ high and τ low high-order and low-order torque estimates provided by the particular Runge-Kutta method, and ∆t the time step. The time step is adjusted using a default headroom of 0.8. In MuMax 3 , 0 is accessible as the variable MaxErr. Its default value of 10 −5 was adequate for the presented standard problems. The relation between 0 and the overall error at the end of the simulation is in general hard to determine. Nevertheless, we illustrate this in Fig. 11 for a single period of spin precession under the same conditions as Fig.10. It can be seen that the absolute error per precession scales linearly with 0 , although the absolute value of the error depends on the solver type and exact simulation conditions.

B. Energy minimization
MuMax 3 provides a relax() function that attempts to find the systems' energy minimum. This function disables the precession term Eq.2, so that the effective field points towards decreasing energy. Relax first advances in time until the total energy cuts into the numerical noise floor. At that point the state will be close to equilibrium already. We then begin monitoring the magnitude of the torque instead of the energy, since close to equilibrium the torque will decrease monotonically and is less noisy than the energy. So we advance further until the torque cuts into the noise floor as well. Each time that happens, we decrease MaxErr and continue further until MaxErr=10 −9 . At this point it does not make sense to increase the accuracy anymore (see Fig.11) and we stop advancing.
This Relax procedure was used in the presented standard problems, where it proved adequate. Typical residual torques after Relax are of the order of 10 −4 -10 −7 γ LL T, indicating that the system is indeed very close to equilibrium. Nevertheless, as with any energy minimization technique, there is always a possibility that the system ends up in a saddle point or very flat part of the energy landscape.
Relax internally uses the RK23 solver, which we noticed performs better then RK45 in most relaxation scenarios. Near equilibrium, both solvers tend to take similarly large time steps, but RK23 needs only half as many torque evaluations per step as RK45.

V. STANDARD PROBLEMS
In this section we provide solutions to micromagnetic standard problems #1-4 provided by the µMag modeling group [2] and standard problem #5 proposed by Najafi et al. [28]. Reference solutions were taken from [2] as noted, or otherwise calculated with OOMMF 1.2 alpha 5 bis [15].

A. Standard Problem #1
The first µMag standard problem involves the hysteresis loops of a 1 µm × 2 µm × 20 nm Permalloy rectangle (A ex = 1.3×10 −11 J/m, M sat = 8×10 5 A/m, K u1 = 5×10 2 J/m 3 uniaxial, with easy axis nominally parallel to the long edges of the rectangle) for the field approximately parallel to the long and short axis, respectively. Our solution is presented in Fig.12. Unfortunately the submitted solutions [2] do not agree with each other, making it impossible to assert the correctness in a quantitative way.

B. Standard Problem #2
The second µMag standard problem considers a thin film of width d, length 5d and thickness 0.1d, all expressed in terms of the exchange length l ex = 2A ex /µ 0 M 2 sat . The remanence and coercitive field, expressed in units M sat , are to be calculated as a function of d/l ex .
The coercivity, shown in Fig.14, behaves interestingly in the small-particle limit where an analytical solution exists [14]. In that case the magnetization is uniform and the magnetostatic field dominates the behaviour. Of the solutions submitted to the µMag group [14,24,26,34], the Streibl [34], Donahue [14] (OOMMF 1.1) and MuMax 3 results best approach the small-particle limit. It was shown by Donahue et al. [14] that proper averaging of the magnetostatic field over each cell volume is needed to accurately reproduce the analytical limit. Hence this standard problem serves as a test for the numerical integration of our demagnetizing kernel.

C. Standard Problem #3
Standard problem #3 considers a cube with edge length L expressed in exchange lengths l ex = 2A ex /µ 0 M 2 sat . The magnet has uniaxial anisotropy with K u1 = 0.1K m , with K m = 1/2µ 0 M 2 sat , easy axis parallel to the z-axis. The  [2]. The slight discrepancy at high d is attributed to OOMMF's solution using larger cells there. The analytical limit for very small size is by Donahue et al. [14].
critical edge size L where the ground state transitions between a quasi-uniform and vortex-like state needs to be found, it is expected around L=8.
This problem was solved using a 16 × 16 × 16 grid. The cube was initialized with ∝3,000 different random initial magnetization states for random edge lengths L between 7.5 and 9, and relaxed to equilibrium. Four stable states were found, shown in Fig.15: a quasi-uniform flower state (a), twisted flower state (b), vortex state (c) and a canted vortex (d). Then cubes of different sizes were initialized to these states and relaxed to equilibrium. The resulting energy for each state, shown in Fig.15, reveals the absolute ground states in the considered range: flower state for L < 8.16, twisted flower for 8.16 < L < 8.47 and vortex for L > 8.47.
The transition at L=8.47 is in quantitative agreement with the solutions posted to µMag by Rave et al. [29] and by Martins et al. [2]. The existence of the twisted flower state was already noted by Hertel et al. [2], although without determining the flower to twisted flower transition point.

VI. EXTENSIONS
MuMax 3 is designed to be modular and extensible. Some of our extensions, described below, have been merged into the mainline code because they may be of general interest. Nevertheless, extensions are considered specific to certain needs and are less generally usable than the aforementioned main features. E.g., MFM images and Voronoi tessellation are only implemented in 2D and only qualitatively tested.

A. Moving frame
MuMax 3 provides an extension to translate the magnetization with respect to the finite difference grid (along the x-axis), inserting new values from the side. This allows the simulation window to seemingly "follow" a region of interest like domain wall moving in a long nanowire, without having to simulate the entire wire. MuMax 3 can automatically translate the magnetization to keep an average magnetization component of choice as close to zero as possible, or the user may arbitrarily translate m from the input script.
When following a domain wall in a long in-plane magnetized wire, we also provide the possibility to remove the magnetic charges on the ends of the wire. This simulates an effectively infinitely long wire without closure domains, as illustrated in Fig.18.
Finally, when shifting the magnetization there is an option to also shift the material regions and geometry along. The geometry and material parameters for the "new" cells that enter the simulation from the side are automatically re-calculated so that new grains and geometrical features may seamlessly enter the simulation. This can be useful for, e.g., simulating a long racetrack with notches like illustrated in Fig.18, or a moving domain wall in a grainy material as published in [23].

B. Voronoi Tessellation
MuMax 3 provides 2D Voronoi Tessalation as a way to simulate grains in thin films, similar to OOMMF [21]. It is possible to have MuMax 3 set-up the regions map with grain-shaped islands, randomly colored with up to 256 region numbers ( Fig.19(a)). The material parameters in each of those regions can then be varied to simulate, e.g., grains with randomly distributed anisotropy axes or even change the exchange coupling between them ( Fig.19(b)).
Our implementation is compatible with the possibility to move the simulation window. E.g., when the simulation window is following a moving domain wall, new grains will automatically enter the simulation from the sides. The new grains are generated using hashes of the cell coordinates so that there is no need to store a (potentially very large) map of all the grains beyond the current simulation grid. More details can be found in [23] FIG. 18: Top frame: magnetization in a 1 µm wide, 20 nm thick Permalloy wire of finite length. The remaining frames apply edge charge removal to simulate an infinitely long wire. The domain wall is driven by a 3×10 12 A/m 2 current while being followed by the simulation window. So that it appears steady although moving at high speed (visible by the wall breakdown). While moving, new notches enter the simulation from the right.

C. Magnetic force microscopy
MuMax 3 has a built-in capability to generate magnetic force microscopy (MFM) images in Dynamic (AC) mode from a 2D magnetization. We calculate the derivative of the force between tip and sample from the convolution: where B tip is the tip's stray field evaluated in the sample plane. MuMax 3 provides the field of an idealized dipole or monopole tip with arbitrary elevation. No attempt is made to reproduce tip fields in absolute terms as our only goal is to produce output proportional to the actual MFM contrast, like shown in Fig.20.
Eq. 31 is implemented using FFT-acceleration similar to the magnetostatic field, and is evaluated on the GPU. Hence MFM image generation is very fast and generally takes only a few milliseconds. This makes it possible to watch "real-time" MFM images in the web interface while the magnetization is evolving in time.

A. Simulation size
Nowadays, GPU's offer massive computational performance of several TFlop/s per device. However, that compute power is only fully utilized in case of sufficient parallelization, i.e., for sufficiently large simulations. This is clearly illustrated by considering how many cells can be processed per second. I.e., N cells /t update with t update the time needed to calculate the torque for N cells cells. We refer to this quantity as the throughput. Given the overall complexity of O(N log(N )) one would expect a nearly constant throughput that slowly degrades at high N . For all presented throughputs, magnetostatic and exchange interactions were enabled and no output was saved.
The throughput presented in Fig.21 for a square 2D simulation on a GTX TITAN GPU only exhibits the theoretical, nearly constant, behaviour starting from about 256 000 cells. Below, the GPU is not fully utilized and performance drops. Fortunately, large simulations are exactly where GPU speed-up is needed most and where performance is optimal.
MuMax 3 's performance is dominated by FFT calculations using the cuFFT library, which performs best for power-of-two sizes and acceptably for 7-smooth numbers (having only factors 2,3,5 and 7). Other numbers, especially primes should be avoided. This is clearly illustrated in Fig.21 where other than the recommended sizes show a performance penalty of up to about an order of magnitude. So somewhat oversizing the grid up to a nice smooth number may be beneficial to the performance.
Note that the data in Fig.22 is for a 2D simulation. Typically a 3D simulation with the same total number of cells costs an additional factor ∝ 1.5× in compute time and memory due to additional FFTs along the z-axis.
On the other hand, simulations with periodic boundary conditions will run considerably faster than their non-periodic counterparts. This is due to the absence of zero-padding which reduces FFT sizes by 2 in each periodic direction. Memory consumption will be considerably lower in this case as well.

B. Hardware
Apart form the simulation size, MuMax 3 's performance is strongly affected by the particular GPU hardware. We highlight the differences between several GPU models by comparing their throughput in Fig.22. This was done for a 4 M cells simulation where all tested GPUs were fully utilized. So the numbers are indicative for all sufficiently large simulation sizes.
We also included OOMMF's throughput on a quad-core 2.1 GHz core i7 CPU to give a rough impression of the GPU speed-up. The measured OOMMF performance (not clearly distinguishable in Fig.22) was around 4×10 6 cells/s. So with a proper GPU and sufficiently large grid sizes, a speed-up of 20-45 × with respect to a quad-core can be reached or, equivalently, a 80-180 × speed-up compared to a single-core CPU. This is in line with earlier MuMax1 and MicroMagnum benchmarks [5,36]. It must be noted however that OOMMF operates in double-precision in contrast to MuMax 3 's single-precision arithmetic, and also does not suffer reduced throughput for simulations.
Finally, MicroMagnum's throughput (not presented) was found to be nearly indistinguishable from MuMax 3 . This is unsurprising since both MuMax 3 's MicroMagnum's performance are dominated by CUDA's FFT routines. In our benchmarks on a GTX650M, differences between both packages were comparable to the noise on the timings.

C. Memory use
In contrast to their massive computational power, GPUs are typically limited to rather small amounts of memory (currently 1-6 GB). Therefore, MuMax 3 was heavily optimized to use as little memory as possible. E.g., we exploited the magnetostatic kernel symmetries and zero elements and make heavy use of memory pooling and recycling. Also, MuMax 3 employs minimal zero-padding in the common situation of 3D simulations with only a small number of layers. For up to 10 layers there is no need to use a power of two, and memory usage will be somewhat reduced as well.
In this way, MuMax 3 on a GPU with only 2 GB of memory is able to simulate about 9 million cells in 2D and 6 million in 3D, or about 2 × more than MicroMagnum v0.2 [5] (see Fig.23). When using a lower-order solver this number can be further increased to 12×10 6 cells with RK23 (2D) or 16×10 6 cells with RK12(2D), all in 2 GB. Cards like the GTX TITAN and K20XM, with 6 GB RAM can store proportionally more, e.g., 31 M cells for 2D with the RK45 solver.

VIII. CONCLUSION
We have presented in detail the micromagnetic model employed by MuMax 3 , as well as a verification for each of its components. GPU acceleration provides a speed-up of 1-2 orders of magnitude compared to CPU-based micromagnetic simulations. In addition, MuMax 3 's low memory requirements open up the possibility of very large-scale micromagnetic simulations, a regime where the GPU's potential is fully utilized and where the speed-up is also needed most. E.g., depending on the solver type MuMax 3 can fit 10-16 million cells in 2 GB GPU RAMabout 2 × more than MuMax2 or MicroMagnum.