Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model

doi:10.1016/j.cpc.2010.05.005

Computer Physics Communications

Volume 181, Issue 9, September 2010, Pages 1549-1556

https://doi.org/10.1016/j.cpc.2010.05.005 Get rights and content

Abstract

A Modern Graphics Processing unit (GPU) is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two-dimensional Ising model [T. Preis et al., Journal of Chemical Physics 228 (2009) 4468–4477] in order to overcome the memory limitations of a single GPU which enables us to simulate significantly larger systems. Using multi-spin coding techniques, we are able to accelerate simulations on a single GPU by factors up to 35 compared to an optimized single Central Processor Unit (CPU) core implementation which employs multi-spin coding. By combining the Compute Unified Device Architecture (CUDA) with the Message Parsing Interface (MPI) on the CPU level, a single Ising lattice can be updated by a cluster of GPUs in parallel. For large systems, the computation time scales nearly linearly with the number of GPUs used. As proof of concept we reproduce the critical temperature of the 2D Ising model using finite size scaling techniques.

Introduction

Various scientific disciplines profited by GPU computing in recent years and are reporting impressive speedup factors in comparison to single Central Processor Unit (CPU) core implementations. GPU stands for Graphics Processing Units which are high-performance many-core processors that can be used to accelerate a wide range of applications. In the meantime, significant savings of computing time have been reported by a huge variety of fields: GPU acceleration can be used in astronomy [1] and radio astronomy [2]. Soft tissue simulation [3], algorithms for image registration [4], dose calculation [5], volume reconstruction from X-ray images [6], and the optimization of intensity-modulated radiation therapy plans [7] are examples for the numerous applications in medicine. Furthermore, DNA sequence alignment [8], molecular dynamics simulations [9], [10], [11], quantum chemistry [12], multipole calculations [13], density functional calculations [14], [15], air pollution modeling [16], time series analysis focused on financial markets [17], [18], and Monte Carlo simulations [19], [20], [21], [22] benefited from GPU computing. For many applications, the accuracy can be comparable to that of a double-precision CPU implementation, such as in [23]—the latest generation of GPUs support not only single precision but also double precision floating point operations. The adaption of many computational methods is still in progress, e.g. the analysis of switching processes in financial markets [24], [25]. Unfortunately, not all algorithms can be ported efficiently onto a GPU architecture. Particularly, serial algorithms are not suited for GPU computing (for an example see e.g. [26]).

Another crucial limitation is the lack of scalability as current programs typically utilize only single GPUs. As graphics processing hardware is targeted at a broad consumer market—the games industry—, graphic cards can be produced at low cost. On the other hand, to keep production costs low, the global memory is not upgradable and typically limited to $1 GB$ for consumer cards and $4 GB$ for Tesla GPUs. Using a recent consumer graphics card, we accelerated Monte Carlo simulations of the Ising model [22]. In [22], a 2D square spin lattice of dimension up to 1024² spins could be processed on a consumer GPU. The Ising model as a standard model of statistical physics provides a simple microscopic description of ferromagnetism [27]. It was introduced to explain the ferromagnetic phase transition from the paramagnetic phase at high temperatures to the ferromagnetic phase below the Curie temperature $T_{C}$ . A large variety of techniques and methods in statistical physics have originally been formulated for the Ising model and were generalized and adapted to related models and problems [28]. Due to its simplicity, which can be embodied by the possibility to use trivial parallelization approaches [29], the two-dimensional Ising model is well suited as a benchmark model since its properties are well studied [30], [31], [32] and many physical systems belong to the same universality class. The Ising model on a two-dimensional square lattice with no magnetic field was analytically solved by Lars Onsager in 1944 [33]. The critical temperature at which a second order phase transition between an ordered and a disordered phase occurs can be determined analytically for the two-dimensional model ( $T_{C} \approx 2.269185$ [33]).

Here we show how lattice sizes can be extended up to $100, 000^{2}$ spins on one GPU device with 4 GB of global memory using a memory optimized encoding of the spins—one bit per spin. This number of spins turns out to be a hard limitation on a single device, since for larger system sizes, spin data would have to be transferred between device and host memory. Such a memory transfer would effectively rule out all performance benefits of a GPU implementation. Using a multi-spin coding scheme [34], [35], [36], [37], computation to memory access ratio can be improved, resulting in a dramatically faster GPU performance.

We show that an extension of this approach can be used successfully to handle Monte Carlo simulations of the Ising model in a multi-GPU environment—GPU clusters. The scalability of this implementation is ensured by splitting the lattice into quadratic sublattices, and by placing them into the memory of different GPUs. Thus, each GPU can perform the calculation of one sublattice in its memory and pass the information about its borders on to its neighboring GPUs. Similar approaches have been used, e.g., for the calculation of density functionals [14].

This paper is organized as follows. In a brief overview in Section 2, key facts of the GPU architecture are provided in order to clarify implementation constraints for the following sections. Section 3 provides a survey of model definition and finite size scaling techniques used as proof of concept. In Sections 4 Optimized reference CPU implementation, 5 Single-GPU implementation, we describe details of the reference CPU implementation and our single GPU approach based on multi-spin coding. The multi-GPU accelerated Monte Carlo simulation of the 2D Ising model is covered in Section 6. To overcome the memory limitations of a single GPU with such a multi-GPU approach is of crucial importance as GPU clusters are currently set up in supercomputing facilities. Our conclusions are summarized in Section 7.

Section snippets

GPU device architecture

Simulating the Ising model at large system sizes requires a lot of processing performance. Physical and engineering obstacles in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. On the other hand, graphics hardware has become highly programmable, the fixed function pipelines have been replaced by programmable shader units that can perform the same operation on many sets of data in parallel. For a comprehensive overview of recent

The two-dimensional Ising model

The Ising model is formulated on a two-dimensional square lattice, where on each lattice site a spin $S_{i}$ with a value of either −1 or 1 is located. The interaction of the spins is given by the Hamiltonian $H = - J \sum_{〈 i, j 〉} S_{i} S_{j} - H \sum_{i} S_{i}$ where H denotes an external magnetic field, which we will set to zero here. The lattice is updated according to the Metropolis criterion [43]. For each step, the energy difference $Δ H = H_{a} - H_{b}$ between two subsequent states a and b is calculated. The probability for the step to

Optimized reference CPU implementation

For our optimized CPU reference implementation, we focus on a single spin-flip approach which performs well for large lattice sizes. Multi-spin coding refers to all techniques that store and process multiple spins in one unit of computer memory. In CPU implementations, update schemes have been developed that allow to process more than one spin with a single operation [34], [35], [36], [37]. We use a scheme which encodes 32 spins into one 32-bit integer in a linear fashion. The 32-bit type is

Problems arising from a straightforward porting

On graphics cards, memory access is very costly compared to operations on registers. The great advantage of multi-spin coding is that only one memory access is needed to obtain several spins at once. The CPU implementation could be ported to GPU with a kernel that uses less than 16 registers. This allows an optimal occupancy of the GPU up to a maximum block size of 512 threads. Even though the update scheme presented in Section 4 performs vastly faster on the CPU compared to an implementation

Implementation

The general idea is to extend the quadratic lattice by putting multiple quadratic “meta-spin” lattices next to each other in a super-lattice (see Fig. 4a for a $2 \times 2$ super-lattice) and let each lattice be handled by one of the installed GPUs. On the border of each lattice, at least one of the neighboring sites is located in the memory of another GPU (see Fig. 4b). For this reason, the spins at the borders of each lattice have to be transferred from one GPU to the GPU handling the adjacent

Conclusion

We presented two major improvements over our previous work. By using multi-spin coding techniques, we improved the computation to memory access ratio of our calculations dramatically, resulting in better overall performance. On a single GPU, up to 7.9 spinflips per nanosecond are possible, 15 to 35 times faster than our highly optimized CPU version, depending on the implementation and the quality of random numbers. The other improvement targets the utilization of GPU clusters, where the 2D

Acknowledgements

We thank K. Binder, D. Stauffer, and M. Tacke for fruitful discussions. GPU time was provided on the NEC Nehalem Cluster by the High Performance Computing Center Stuttgart (HLRS). We are grateful to T. Bönisch, B. Krischok, and H. Pöhlmann for their support at the HLRS. This work was financially supported by the Deutsche Forschungsgemeinschaft (DFG) and benefited from the Gutenberg-Akademie and the Schwerpunkt für rechnergestützte Forschungsmethoden in den Naturwissenschaften and the

References (49)

E.B. Ford
Parallel algorithm for solving Kepler's equation on graphics processing units: Application to analysis of Doppler exoplanet searches
New Astronomy
(2009)
Z.A. Taylor et al.
On modelling of anisotropic viscoelasticity for soft tissue simulation: Numerical solution and GPU execution
Medical Image Analysis
(2009)
C. Trapnell et al.
Optimizing data intensive GPGPU computations for DNA sequence alignment
Parallel Computing
(2009)
J.A. Anderson et al.
General purpose molecular dynamics simulations fully implemented on graphics processing units
Journal of Computational Physics
(2008)
N.A. Gumerov et al.
Fast multipole methods on graphics processors
Journal of Computational Physics
(2008)
F. Molnar et al.
Air pollution modelling using a graphics processing unit with CUDA
Computer Physics Communications
(2010)
J.S. Meredith et al.
Accuracy and performance of graphics processors: A quantum Monte Carlo application case study
Parallel Computing
(2009)
T. Preis et al.
GPU accelerated Monte Carlo simulation of the 2d and 3d Ising model
Journal of Computational Physics
(2009)
D. Reith et al.
Implementation and performance analysis of bridging Monte Carlo moves for off-lattice single chain polymers in globular states
Computer Physics Communications
(2010)
K. Binder et al.
Monte Carlo tests of renormalization-group predictions for critical phenomena in Ising models
Physics Reports
(2001)

D. Stauffer

Kinetics of clusters in Ising-models

Physica A

(1992)

R. Zorn et al.

Parallelization of the Ising simulation

Computer Physics Communications

(1981)

C. Harris et al.

GPU accelerated radio astronomy signal convolution

Experimental Astronomy

(2008)

X. Gu et al.

Implementation and evaluation of various demons deformable image registration algorithms on a GPU

Physics in Medicine and Biology

(2009)

X. Gu et al.

GPU-based ultra-fast dose calculation using a finite size pencil beam model

Physics in Medicine and Biology

(2009)

D. Gross et al.

GPU-based volume reconstruction from very few arbitrarily aligned X-ray images

SIAM Journal on Scientific Computing

(2009)

C.H. Men et al.

GPU-based ultrafast IMRT plan optimization

Physics in Medicine and Biology

(2009)

M.S. Friedrichs et al.

Accelerating molecular dynamic simulation on graphics processing units

Journal of Computational Chemistry

(2009)

J.A. van Meel et al.

Harvesting graphics power for MD simulations

Molecular Simulation

(2008)

I.S. Ufimtsev et al.

Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation

Journal of Chemical Theory and Computation

(2009)

L. Genovese et al.

Density functional theory calculation on many-cores hybrid central processing unit–graphic processing unit architectures

Journal of Chemical Physics

(2009)

K. Yasuda

Accelerating density functional calculations with graphics processing unit

Journal of Chemical Theory and Computation

(2008)

T. Preis et al.

Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets

New Journal of Physics

(2009)

T. Preis et al.

Fluctuation patterns in high-frequency financial asset returns

Europhysics Letters

(2008)

Cited by (103)

The effect of defects on magnetic droplet nucleation
2023, Physica A: Statistical Mechanics and its Applications
Defects and impurities strongly affect the timing and the character of the (re)ordering or disordering transitions of thermodynamic systems captured in metastable states. In this paper we analyse the case of two-dimensional magnetic systems. We adapt the classical JMAK theory to account for the effects of defects on the free energy barriers, the critical droplet area and the associated metastable time. The resulting predictions are successfully tested against the Monte-Carlo simulations performed by adopting Glauber dynamics, to obtain reliable time-dependent results during the out-of-equilibrium transformations. We also focus on finite-size effects, and study how the spinodal line (separating the single-droplet from the multi-droplet regime) depends on the system size, the defect fraction, and the external field.
Computation of magnetization, exchange stiffness, anisotropy, and susceptibilities in large-scale systems using GPU-accelerated atomistic parallel Monte Carlo algorithms
2021, Journal of Magnetism and Magnetic Materials
Monte Carlo algorithms are frequently used in atomistic simulations, including for computation of magnetic parameter temperature dependences in multiscale simulations. Even though parallelization strategies for Monte Carlo simulations of lattice spin models are known, its application to computation of magnetic parameter temperature dependences is lacking in the literature. Here we show how, not only the unconstrained algorithm, but also the constrained atomistic Monte Carlo algorithm, can be parallelized for any spin–lattice crystal structure. Compared to the serial algorithms, the parallel Monte Carlo algorithms are typically over 200 times faster, allowing computations in systems with over 10 million atomistic spins on a single graphical processing unit with relative ease. Implementation and testing of the algorithms was carried out in large-scale systems, where finite-size effects are reduced, by accurately computing temperature dependences of magnetization, uniaxial and cubic anisotropies, exchange stiffness, and susceptibilities. In particular for the exchange stiffness the Bloch domain wall method was used with a large cross-sectional area, which allows accurate computation of the domain wall width up to the Curie temperature. The exchange stiffness for a simple cubic lattice closely follows an m^k scaling at low temperatures, with k < 2 dependent on the anisotropy strength. However, close to the Curie temperature the scaling exponent tends to k = 2. Furthermore, the implemented algorithms are applied to the computation of magnetization temperature dependence in granular thin films with over 15 million spins, as a function of average grain size and film thickness. We show the average Curie temperature in such systems may be obtained from a weighted Bloch series fit, which is useful for analysis of experimental results in granular thin films.
High performance implementations of the 2D Ising model on GPUs
2020, Computer Physics Communications
Citation Excerpt :
We compare our performance with those recently produced on other computing platforms [10,11]. Previous results are available in [12–14] and [15]. To begin, a basic single-GPU implementation of the checkerboard Metropolis algorithm was implemented in Python, combining features available in the popular Numba and CuPy packages.
We present and make available novel implementations of the two-dimensional Ising model that is used as a benchmark to show the computational capabilities of modern Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to take advantage of Tensor Cores available on NVIDIA GPUs, and a highly optimized multi-spin coding approach. Using the managed memory API available in CUDA allows for simple and efficient distribution of these implementations across a multi-GPU NVIDIA DGX-2 server. We show that even a basic GPU implementation can outperform current results published on TPUs (Yang et al., 2019) and that the optimized multi-GPU implementation can simulate very large lattices faster than custom FPGA solutions (Ortega-Zamorano et al., 2016).
Program title: cuIsing (optimized).
CPC Library link to program files: http://dx.doi.org/10.17632/xrb9xtkbcp.1
Licensing provisions: MIT license.
Programming languages: CUDA C, Python.
Nature of problem: Two dimensional Ising model for spin systems.
Solution method: Checkerboard Metropolis algorithm.
GPU-based Ising computing for solving max-cut combinatorial optimization problems
2019, Integration
Citation Excerpt :
The problem sizes in the design automation domain can easily accomplish this and heuristic methods can solve the Ising model in a parallel manner which makes the GPU ideal for this application. We remark that extensive work for Ising computing on GPUs have been proposed already [14–16], however; they still focus on physics problems which assume a nearest neighbor model only. This model is highly amenable to the GPU computing as it is easily load balanced across threads but is not general enough to handle problems such as max-cut without extensive pre-processing, again through graph embedding.
In VLSI physical design, many algorithms require the solution of difficult combinatorial optimization problems such as max/min-cut, max-flow problems etc. Due to the vast number of elements typically found in this problem domain, these problems are computationally intractable leading to the use of approximate solutions. In this work, we explore the Ising spin glass model as a solution methodology for hard combinatorial optimization problems using the general purpose GPU (GPGPU). The Ising model is a mathematical model of ferromagnetism in statistical mechanics. Ising computing finds a minimum energy state for the Ising model which essentially corresponds to the expected optimal solution of the original problem. Many combinatorial optimization problems can be mapped into the Ising model. In our work, we focus on the max-cut problem as it is relevant to many VLSI design automation problems. Our method is inspired by the observation that Ising annealing process is very amenable to fine-grain massive parallel GPU computing. We will illustrate how the natural randomness of GPU thread scheduling can be exploited during the annealing process to create random update patterns and allow better GPU resource utilization. Furthermore, the proposed GPU-based Ising computing can handle any general Ising graph with arbitrary connections, which was shown to be difficult for existing FPGA and other hardware based implementation methods. Numerical results show that the proposed GPU Ising max-cut solver can deliver more than 2000X speedup over the CPU version of the algorithm on some large examples, which shows huge performance improvement for addressing many hard optimization algorithms for solving practical VLSI design automation problems.
Roadmap for unconventional computing with nanotechnology
2024, Nano Futures
GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data
2023, Scientific Reports

View all citing articles on Scopus

^☆: Source code of our implementations for GPU clusters will be published on http://www.tobiaspreis.de after acceptance. In addition, the code can be downloaded from the Google Code project multigpu-ising.

View full text

Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model☆

Abstract

Introduction

Section snippets

GPU device architecture

The two-dimensional Ising model

Optimized reference CPU implementation

Problems arising from a straightforward porting

Implementation

Conclusion

Acknowledgements

New Astronomy

Medical Image Analysis

Parallel Computing

Journal of Computational Physics

Journal of Computational Physics

Computer Physics Communications

Parallel Computing

Journal of Computational Physics

Computer Physics Communications

Physics Reports

Physica A

Computer Physics Communications

GPU accelerated radio astronomy signal convolution

Experimental Astronomy

Implementation and evaluation of various demons deformable image registration algorithms on a GPU

Physics in Medicine and Biology

GPU-based ultra-fast dose calculation using a finite size pencil beam model

Physics in Medicine and Biology

GPU-based volume reconstruction from very few arbitrarily aligned X-ray images

SIAM Journal on Scientific Computing

GPU-based ultrafast IMRT plan optimization

Physics in Medicine and Biology

Accelerating molecular dynamic simulation on graphics processing units

Journal of Computational Chemistry

Harvesting graphics power for MD simulations

Molecular Simulation

Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation

Journal of Chemical Theory and Computation

Density functional theory calculation on many-cores hybrid central processing unit–graphic processing unit architectures

Journal of Chemical Physics

Accelerating density functional calculations with graphics processing unit

Journal of Chemical Theory and Computation

Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets

New Journal of Physics

Fluctuation patterns in high-frequency financial asset returns

Europhysics Letters