Grace: a Cross-platform Micromagnetic Simulator On Graphics Processing Units

A micromagnetic simulator running on graphics processing unit (GPU) is presented. It achieves significant performance boost as compared to previous central processing unit (CPU) simulators, up to two orders of magnitude for large input problems. Different from GPU implementations of other research groups, this simulator is developed with C++ Accelerated Massive Parallelism (C++ AMP) and is hardware platform compatible. It runs on GPU from venders include NVidia, AMD and Intel, which paved the way for fast micromagnetic simulation on both high-end workstations with dedicated graphics cards and low-end personal computers with integrated graphics card. A copy of the simulator software is publicly available.


Introduction
Micromagnetic simulators are critical tools to study magnetic dynamics and develop new magnetic devices. Central Processing Unit (CPU) based simulators such as OOMMF [1] and magpar [2] are widely used in academic research and industrial applications.  [4]. Still, the simulation can be slow in the case of a large input problem size, as the processing power of a CPU is limited.
Recently there has been implementation of micromagnetic simulators on graphics processing unit (GPU) by some research groups [5][6][7][8][9]. The purpose is to utilize the high computing power of GPU to speed up the simulation. On the other hand, the cost of GPU (usually less than $1000 for a high-end product) is much less than CPU-based clusters. Furthermore, the FFT algorithm used to evaluate demagnetization field can be easily adapted on GPU because they are usually implemented in numeric library developed by hardware vendors.
The previously mentioned GPU simulators are all based on NVidia's Compute Unified Device Architecture (CUDA) which limits their applications to NVidia GPUs. Simulators written in CUDA cannot run on GPUs manufactured by other vendors, such as AMD or Intel. Given that these GPUs are popular on professional workstations and personal computing devices, it is desirable to develop a micromagnetic simulator that is not only GPU-accelerated but also hardware cross-platform.
In this paper, Grace, a cross-platform micromagnetic simulator is demonstrated with a speed-up factor of over 100 with respect to CPU calculation for large problems sizes. Section 2 discusses the formulation of micromagnetic simulation. Section 3 describes the implementation of formulation on GPU. In section 4 the performance of this simulator is evaluated and the MAG standard problem #4 [10] is used to validate the calculation result. The software download and usage information is given in section 5. In the end, section 6 summarizes the paper and discusses potential future work.

Formulation
The energy density consists of the exchange, anisotropy, demagnetization and Zeeman energy densities, where A is the material exchange constant, K u is the uniaxial anisotropy constant, 0  is the vacuum permeability, H demag is the demagnetization field and H extern is the external field. The anisotropy is on the x direction and assumed to be uniaxial.
The dynamics of magnetization is affected by the effective magnetic field H eff calculated from the magnetic energy density: is the exchange field and anis H  is the anisotropy field. A detailed version of how to calculate each term in effective field can be found in [11].
The Landau-Lifshitz-Gilbert (LLG) equation that governs the magnetic dynamics in the low damping limit is [12] )] where α is the damping constant, and γ is the gyromagnetic ratio.

Implementation
The most critical step in micromagnetic simulation, as mentioned before, is the calculation of the demagnetization field. The direction calculation for N sources at N observers requires computing time of . But since the demagnetization field is actually the convolution of magnetizations and demagnetization tensor in a regular discretization of material, the computation time can be reduced to by applying the discrete convolution theorem and FFT. Non-periodic boundary conditions can be used by adapting the zero-padding method [3]. On the other hand, the exchange field calculation is done with a six-neighbor scheme [13]. The time integration of the LLG equation is implemented with Euler method.
GPU is the backbone hardware that accelerates the simulation. As opposed to CPU that has limited number of Arithmetic Logic Units (ALU) but complicated logic control unit, GPU has much greater number of ALUs but less logic control for each ALU. As a result, GPU is suitable for computing-intensive, highly parallelized but simple algorithms. That is the reason large scale micromagnetic simulations, with the aid of FFT algorithm is an ideal case in which GPU acceleration can be applied.
The software platform is C++ Accelerated Massive Parallelism (C++ AMP) [14], which is an open specification library developed by Microsoft for implementing data parallelism directly from C++. Compared to other popular parallel computing language (such as CUDA), it is fully compatible with different hardware platform, so that the program written in C++ AMP can migrate to a different GPU without any modification. It also features simplified Application Programming Interface (API) to make the programming on GPU easy for developers.
The computing power of GPU is considerable (> 1 Trillion floating point operations per second or TFLOPS for a high-end product), but it is much slower at transferring data between CPU and GPU (about 10 GB/sec), which is the bottleneck of high performance GPU computing. To maximize the simulation speed of Grace, all the calculation is done on the GPU, except for reading input from user and writing data to output file.
The FFT algorithm used to calculate demagnetization field is based on C++ AMP FFT library [15]. At large input sizes it can be two orders of magnitude faster than CPU-based FFT library such as FFTW, which ensures the performance of the simulator.

Performance and Validation
To benchmark the performance of the simulator, a cubic magnetic sample with exchange constant 100  was studied. The sample was divided in to grids of N×N×N and reached its relaxation state by applying LLG equation to each cell. The testing hardware was an AMD Radeon HD 7970 GHz Edition GPU with an Intel Xeon E5410 CPU. The GPU in use was among the fastest available on the consumer market but still cost less than $500. For comparison, the benchmark data on OOMMF from another research group [6] was also presented, who used an Intel i7-930 CPU.
is large at a small N, which means a large amount of time is allocated to data I/O. That hinders the overall performance. Second, the GPU has a kernel launching overhead which is constant and not dependent on problem size. This overhead becomes significant at smaller problem sizes.
It can also be noticed that the computing time of GPU did not increase very much with respect to problem size (8 < N < 32) at smaller problem sizes, but constantly increases after N exceeded 32. This can be explained by the hardware architecture of GPU. The GPU in use (AMD Radeon HD 7970) has as many as 2048 stream processors that can process data concurrently. For smaller size problems it is inevitable that some of these processors will be left idle, so the processing time doesn't change very much if the problem size is slightly increased. Only at large problem sizes will these processors be fully utilized and constant increase in processing time versus problem size will be observed, as shown in figure 1 at N > 32.
The mag standard problem #4 [10] was used to validate the calculation result. In this problem a thin film sample is divided in to and no anisotropy. Before applying external fields to reverse the magnetization, the system is relaxed to S-state by setting a large damping constant. Then two tests are carried out separately, one with field 1 of (-24.6 mT, 4.3 mT, 0 mT) and the other with field 2 of (-35.5 mT, -6.3 mT, 0 mT). The damping constant α is set to 0.02 for both tests.    According to figure 2-5 the average magnetization results and the magnetization distribution from Grace is in good agreement with that from OOMMF. The result is thus reliable.

Software download and usage
A preview version of Grace can be downloaded at https://sites.google.com/site/gracegpu/. A GPU that supports DirectX 11 or newer and Windows 7 or later are required to run the software. Most computers manufactured after 2009 meet the requirement. It features a simple but straightforward input file, an output file and a gnuplot script file to visualize the simulation results. Both input and output files uses ASCII plain text format to store data for the best compatibility. A sample input file is as follows. In the file above, outputInterval is the interval of writing simulation data to output file, i.e. one write operation is executed every 10 timesteps. dt is the timestep in nanoseconds. This simulation is set to simulate a time span of 10000 × 1e-4 = 1 nanosecond. In the line that follows, nx, ny and nz are the dimensions of sample size in nanometers. The default discretization is 1nm × 1nm × 1nm. Other parameters are self-explanatory. Detailed instruction on how to use Grace can be found on the website mentioned earlier.

Summary
To the best knowledge of the author, Grace is the first implementation of micromagnetic simulator on C++ AMP and is fully hardware independent. It can run on high-end professional graphics workstations and also on low-end personal laptops with integrated GPU. Speedup factor of over 100 is achieved at large simulations. More features will be added to Grace in the future, including the use of non-regular geometry and adaptive time steps.