Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model☆
Introduction
Various scientific disciplines profited by GPU computing in recent years and are reporting impressive speedup factors in comparison to single Central Processor Unit (CPU) core implementations. GPU stands for Graphics Processing Units which are high-performance many-core processors that can be used to accelerate a wide range of applications. In the meantime, significant savings of computing time have been reported by a huge variety of fields: GPU acceleration can be used in astronomy [1] and radio astronomy [2]. Soft tissue simulation [3], algorithms for image registration [4], dose calculation [5], volume reconstruction from X-ray images [6], and the optimization of intensity-modulated radiation therapy plans [7] are examples for the numerous applications in medicine. Furthermore, DNA sequence alignment [8], molecular dynamics simulations [9], [10], [11], quantum chemistry [12], multipole calculations [13], density functional calculations [14], [15], air pollution modeling [16], time series analysis focused on financial markets [17], [18], and Monte Carlo simulations [19], [20], [21], [22] benefited from GPU computing. For many applications, the accuracy can be comparable to that of a double-precision CPU implementation, such as in [23]—the latest generation of GPUs support not only single precision but also double precision floating point operations. The adaption of many computational methods is still in progress, e.g. the analysis of switching processes in financial markets [24], [25]. Unfortunately, not all algorithms can be ported efficiently onto a GPU architecture. Particularly, serial algorithms are not suited for GPU computing (for an example see e.g. [26]).
Another crucial limitation is the lack of scalability as current programs typically utilize only single GPUs. As graphics processing hardware is targeted at a broad consumer market—the games industry—, graphic cards can be produced at low cost. On the other hand, to keep production costs low, the global memory is not upgradable and typically limited to for consumer cards and for Tesla GPUs. Using a recent consumer graphics card, we accelerated Monte Carlo simulations of the Ising model [22]. In [22], a 2D square spin lattice of dimension up to 10242 spins could be processed on a consumer GPU. The Ising model as a standard model of statistical physics provides a simple microscopic description of ferromagnetism [27]. It was introduced to explain the ferromagnetic phase transition from the paramagnetic phase at high temperatures to the ferromagnetic phase below the Curie temperature . A large variety of techniques and methods in statistical physics have originally been formulated for the Ising model and were generalized and adapted to related models and problems [28]. Due to its simplicity, which can be embodied by the possibility to use trivial parallelization approaches [29], the two-dimensional Ising model is well suited as a benchmark model since its properties are well studied [30], [31], [32] and many physical systems belong to the same universality class. The Ising model on a two-dimensional square lattice with no magnetic field was analytically solved by Lars Onsager in 1944 [33]. The critical temperature at which a second order phase transition between an ordered and a disordered phase occurs can be determined analytically for the two-dimensional model ( [33]).
Here we show how lattice sizes can be extended up to spins on one GPU device with 4 GB of global memory using a memory optimized encoding of the spins—one bit per spin. This number of spins turns out to be a hard limitation on a single device, since for larger system sizes, spin data would have to be transferred between device and host memory. Such a memory transfer would effectively rule out all performance benefits of a GPU implementation. Using a multi-spin coding scheme [34], [35], [36], [37], computation to memory access ratio can be improved, resulting in a dramatically faster GPU performance.
We show that an extension of this approach can be used successfully to handle Monte Carlo simulations of the Ising model in a multi-GPU environment—GPU clusters. The scalability of this implementation is ensured by splitting the lattice into quadratic sublattices, and by placing them into the memory of different GPUs. Thus, each GPU can perform the calculation of one sublattice in its memory and pass the information about its borders on to its neighboring GPUs. Similar approaches have been used, e.g., for the calculation of density functionals [14].
This paper is organized as follows. In a brief overview in Section 2, key facts of the GPU architecture are provided in order to clarify implementation constraints for the following sections. Section 3 provides a survey of model definition and finite size scaling techniques used as proof of concept. In Sections 4 Optimized reference CPU implementation, 5 Single-GPU implementation, we describe details of the reference CPU implementation and our single GPU approach based on multi-spin coding. The multi-GPU accelerated Monte Carlo simulation of the 2D Ising model is covered in Section 6. To overcome the memory limitations of a single GPU with such a multi-GPU approach is of crucial importance as GPU clusters are currently set up in supercomputing facilities. Our conclusions are summarized in Section 7.
Section snippets
GPU device architecture
Simulating the Ising model at large system sizes requires a lot of processing performance. Physical and engineering obstacles in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. On the other hand, graphics hardware has become highly programmable, the fixed function pipelines have been replaced by programmable shader units that can perform the same operation on many sets of data in parallel. For a comprehensive overview of recent
The two-dimensional Ising model
The Ising model is formulated on a two-dimensional square lattice, where on each lattice site a spin with a value of either −1 or 1 is located. The interaction of the spins is given by the Hamiltonian where H denotes an external magnetic field, which we will set to zero here. The lattice is updated according to the Metropolis criterion [43]. For each step, the energy difference between two subsequent states a and b is calculated. The probability for the step to
Optimized reference CPU implementation
For our optimized CPU reference implementation, we focus on a single spin-flip approach which performs well for large lattice sizes. Multi-spin coding refers to all techniques that store and process multiple spins in one unit of computer memory. In CPU implementations, update schemes have been developed that allow to process more than one spin with a single operation [34], [35], [36], [37]. We use a scheme which encodes 32 spins into one 32-bit integer in a linear fashion. The 32-bit type is
Problems arising from a straightforward porting
On graphics cards, memory access is very costly compared to operations on registers. The great advantage of multi-spin coding is that only one memory access is needed to obtain several spins at once. The CPU implementation could be ported to GPU with a kernel that uses less than 16 registers. This allows an optimal occupancy of the GPU up to a maximum block size of 512 threads. Even though the update scheme presented in Section 4 performs vastly faster on the CPU compared to an implementation
Implementation
The general idea is to extend the quadratic lattice by putting multiple quadratic “meta-spin” lattices next to each other in a super-lattice (see Fig. 4a for a super-lattice) and let each lattice be handled by one of the installed GPUs. On the border of each lattice, at least one of the neighboring sites is located in the memory of another GPU (see Fig. 4b). For this reason, the spins at the borders of each lattice have to be transferred from one GPU to the GPU handling the adjacent
Conclusion
We presented two major improvements over our previous work. By using multi-spin coding techniques, we improved the computation to memory access ratio of our calculations dramatically, resulting in better overall performance. On a single GPU, up to 7.9 spinflips per nanosecond are possible, 15 to 35 times faster than our highly optimized CPU version, depending on the implementation and the quality of random numbers. The other improvement targets the utilization of GPU clusters, where the 2D
Acknowledgements
We thank K. Binder, D. Stauffer, and M. Tacke for fruitful discussions. GPU time was provided on the NEC Nehalem Cluster by the High Performance Computing Center Stuttgart (HLRS). We are grateful to T. Bönisch, B. Krischok, and H. Pöhlmann for their support at the HLRS. This work was financially supported by the Deutsche Forschungsgemeinschaft (DFG) and benefited from the Gutenberg-Akademie and the Schwerpunkt für rechnergestützte Forschungsmethoden in den Naturwissenschaften and the
References (49)
Parallel algorithm for solving Kepler's equation on graphics processing units: Application to analysis of Doppler exoplanet searches
New Astronomy
(2009)- et al.
On modelling of anisotropic viscoelasticity for soft tissue simulation: Numerical solution and GPU execution
Medical Image Analysis
(2009) - et al.
Optimizing data intensive GPGPU computations for DNA sequence alignment
Parallel Computing
(2009) - et al.
General purpose molecular dynamics simulations fully implemented on graphics processing units
Journal of Computational Physics
(2008) - et al.
Fast multipole methods on graphics processors
Journal of Computational Physics
(2008) - et al.
Air pollution modelling using a graphics processing unit with CUDA
Computer Physics Communications
(2010) - et al.
Accuracy and performance of graphics processors: A quantum Monte Carlo application case study
Parallel Computing
(2009) - et al.
GPU accelerated Monte Carlo simulation of the 2d and 3d Ising model
Journal of Computational Physics
(2009) - et al.
Implementation and performance analysis of bridging Monte Carlo moves for off-lattice single chain polymers in globular states
Computer Physics Communications
(2010) - et al.
Monte Carlo tests of renormalization-group predictions for critical phenomena in Ising models
Physics Reports
(2001)
Kinetics of clusters in Ising-models
Physica A
Parallelization of the Ising simulation
Computer Physics Communications
GPU accelerated radio astronomy signal convolution
Experimental Astronomy
Implementation and evaluation of various demons deformable image registration algorithms on a GPU
Physics in Medicine and Biology
GPU-based ultra-fast dose calculation using a finite size pencil beam model
Physics in Medicine and Biology
GPU-based volume reconstruction from very few arbitrarily aligned X-ray images
SIAM Journal on Scientific Computing
GPU-based ultrafast IMRT plan optimization
Physics in Medicine and Biology
Accelerating molecular dynamic simulation on graphics processing units
Journal of Computational Chemistry
Harvesting graphics power for MD simulations
Molecular Simulation
Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation
Journal of Chemical Theory and Computation
Density functional theory calculation on many-cores hybrid central processing unit–graphic processing unit architectures
Journal of Chemical Physics
Accelerating density functional calculations with graphics processing unit
Journal of Chemical Theory and Computation
Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets
New Journal of Physics
Fluctuation patterns in high-frequency financial asset returns
Europhysics Letters
Cited by (103)
The effect of defects on magnetic droplet nucleation
2023, Physica A: Statistical Mechanics and its ApplicationsComputation of magnetization, exchange stiffness, anisotropy, and susceptibilities in large-scale systems using GPU-accelerated atomistic parallel Monte Carlo algorithms
2021, Journal of Magnetism and Magnetic MaterialsHigh performance implementations of the 2D Ising model on GPUs
2020, Computer Physics CommunicationsCitation Excerpt :We compare our performance with those recently produced on other computing platforms [10,11]. Previous results are available in [12–14] and [15]. To begin, a basic single-GPU implementation of the checkerboard Metropolis algorithm was implemented in Python, combining features available in the popular Numba and CuPy packages.
GPU-based Ising computing for solving max-cut combinatorial optimization problems
2019, IntegrationCitation Excerpt :The problem sizes in the design automation domain can easily accomplish this and heuristic methods can solve the Ising model in a parallel manner which makes the GPU ideal for this application. We remark that extensive work for Ising computing on GPUs have been proposed already [14–16], however; they still focus on physics problems which assume a nearest neighbor model only. This model is highly amenable to the GPU computing as it is easily load balanced across threads but is not general enough to handle problems such as max-cut without extensive pre-processing, again through graph embedding.
Roadmap for unconventional computing with nanotechnology
2024, Nano FuturesGPU-acceleration of the distributed-memory database peptide search of mass spectrometry data
2023, Scientific Reports
- ☆
Source code of our implementations for GPU clusters will be published on http://www.tobiaspreis.de after acceptance. In addition, the code can be downloaded from the Google Code project multigpu-ising.