Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model

https://doi.org/10.1016/j.cpc.2010.05.005Get rights and content

Abstract

A Modern Graphics Processing unit (GPU) is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two-dimensional Ising model [T. Preis et al., Journal of Chemical Physics 228 (2009) 4468–4477] in order to overcome the memory limitations of a single GPU which enables us to simulate significantly larger systems. Using multi-spin coding techniques, we are able to accelerate simulations on a single GPU by factors up to 35 compared to an optimized single Central Processor Unit (CPU) core implementation which employs multi-spin coding. By combining the Compute Unified Device Architecture (CUDA) with the Message Parsing Interface (MPI) on the CPU level, a single Ising lattice can be updated by a cluster of GPUs in parallel. For large systems, the computation time scales nearly linearly with the number of GPUs used. As proof of concept we reproduce the critical temperature of the 2D Ising model using finite size scaling techniques.

Introduction

Various scientific disciplines profited by GPU computing in recent years and are reporting impressive speedup factors in comparison to single Central Processor Unit (CPU) core implementations. GPU stands for Graphics Processing Units which are high-performance many-core processors that can be used to accelerate a wide range of applications. In the meantime, significant savings of computing time have been reported by a huge variety of fields: GPU acceleration can be used in astronomy [1] and radio astronomy [2]. Soft tissue simulation [3], algorithms for image registration [4], dose calculation [5], volume reconstruction from X-ray images [6], and the optimization of intensity-modulated radiation therapy plans [7] are examples for the numerous applications in medicine. Furthermore, DNA sequence alignment [8], molecular dynamics simulations [9], [10], [11], quantum chemistry [12], multipole calculations [13], density functional calculations [14], [15], air pollution modeling [16], time series analysis focused on financial markets [17], [18], and Monte Carlo simulations [19], [20], [21], [22] benefited from GPU computing. For many applications, the accuracy can be comparable to that of a double-precision CPU implementation, such as in [23]—the latest generation of GPUs support not only single precision but also double precision floating point operations. The adaption of many computational methods is still in progress, e.g. the analysis of switching processes in financial markets [24], [25]. Unfortunately, not all algorithms can be ported efficiently onto a GPU architecture. Particularly, serial algorithms are not suited for GPU computing (for an example see e.g. [26]).

Another crucial limitation is the lack of scalability as current programs typically utilize only single GPUs. As graphics processing hardware is targeted at a broad consumer market—the games industry—, graphic cards can be produced at low cost. On the other hand, to keep production costs low, the global memory is not upgradable and typically limited to 1 GB for consumer cards and 4 GB for Tesla GPUs. Using a recent consumer graphics card, we accelerated Monte Carlo simulations of the Ising model [22]. In [22], a 2D square spin lattice of dimension up to 10242 spins could be processed on a consumer GPU. The Ising model as a standard model of statistical physics provides a simple microscopic description of ferromagnetism [27]. It was introduced to explain the ferromagnetic phase transition from the paramagnetic phase at high temperatures to the ferromagnetic phase below the Curie temperature TC. A large variety of techniques and methods in statistical physics have originally been formulated for the Ising model and were generalized and adapted to related models and problems [28]. Due to its simplicity, which can be embodied by the possibility to use trivial parallelization approaches [29], the two-dimensional Ising model is well suited as a benchmark model since its properties are well studied [30], [31], [32] and many physical systems belong to the same universality class. The Ising model on a two-dimensional square lattice with no magnetic field was analytically solved by Lars Onsager in 1944 [33]. The critical temperature at which a second order phase transition between an ordered and a disordered phase occurs can be determined analytically for the two-dimensional model (TC2.269185 [33]).

Here we show how lattice sizes can be extended up to 100,0002 spins on one GPU device with 4 GB of global memory using a memory optimized encoding of the spins—one bit per spin. This number of spins turns out to be a hard limitation on a single device, since for larger system sizes, spin data would have to be transferred between device and host memory. Such a memory transfer would effectively rule out all performance benefits of a GPU implementation. Using a multi-spin coding scheme [34], [35], [36], [37], computation to memory access ratio can be improved, resulting in a dramatically faster GPU performance.

We show that an extension of this approach can be used successfully to handle Monte Carlo simulations of the Ising model in a multi-GPU environment—GPU clusters. The scalability of this implementation is ensured by splitting the lattice into quadratic sublattices, and by placing them into the memory of different GPUs. Thus, each GPU can perform the calculation of one sublattice in its memory and pass the information about its borders on to its neighboring GPUs. Similar approaches have been used, e.g., for the calculation of density functionals [14].

This paper is organized as follows. In a brief overview in Section 2, key facts of the GPU architecture are provided in order to clarify implementation constraints for the following sections. Section 3 provides a survey of model definition and finite size scaling techniques used as proof of concept. In Sections 4 Optimized reference CPU implementation, 5 Single-GPU implementation, we describe details of the reference CPU implementation and our single GPU approach based on multi-spin coding. The multi-GPU accelerated Monte Carlo simulation of the 2D Ising model is covered in Section 6. To overcome the memory limitations of a single GPU with such a multi-GPU approach is of crucial importance as GPU clusters are currently set up in supercomputing facilities. Our conclusions are summarized in Section 7.

Section snippets

GPU device architecture

Simulating the Ising model at large system sizes requires a lot of processing performance. Physical and engineering obstacles in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. On the other hand, graphics hardware has become highly programmable, the fixed function pipelines have been replaced by programmable shader units that can perform the same operation on many sets of data in parallel. For a comprehensive overview of recent

The two-dimensional Ising model

The Ising model is formulated on a two-dimensional square lattice, where on each lattice site a spin Si with a value of either −1 or 1 is located. The interaction of the spins is given by the HamiltonianH=Ji,jSiSjHiSi where H denotes an external magnetic field, which we will set to zero here. The lattice is updated according to the Metropolis criterion [43]. For each step, the energy difference ΔH=HaHb between two subsequent states a and b is calculated. The probability for the step to

Optimized reference CPU implementation

For our optimized CPU reference implementation, we focus on a single spin-flip approach which performs well for large lattice sizes. Multi-spin coding refers to all techniques that store and process multiple spins in one unit of computer memory. In CPU implementations, update schemes have been developed that allow to process more than one spin with a single operation [34], [35], [36], [37]. We use a scheme which encodes 32 spins into one 32-bit integer in a linear fashion. The 32-bit type is

Problems arising from a straightforward porting

On graphics cards, memory access is very costly compared to operations on registers. The great advantage of multi-spin coding is that only one memory access is needed to obtain several spins at once. The CPU implementation could be ported to GPU with a kernel that uses less than 16 registers. This allows an optimal occupancy of the GPU up to a maximum block size of 512 threads. Even though the update scheme presented in Section 4 performs vastly faster on the CPU compared to an implementation

Implementation

The general idea is to extend the quadratic lattice by putting multiple quadratic “meta-spin” lattices next to each other in a super-lattice (see Fig. 4a for a 2×2 super-lattice) and let each lattice be handled by one of the installed GPUs. On the border of each lattice, at least one of the neighboring sites is located in the memory of another GPU (see Fig. 4b). For this reason, the spins at the borders of each lattice have to be transferred from one GPU to the GPU handling the adjacent

Conclusion

We presented two major improvements over our previous work. By using multi-spin coding techniques, we improved the computation to memory access ratio of our calculations dramatically, resulting in better overall performance. On a single GPU, up to 7.9 spinflips per nanosecond are possible, 15 to 35 times faster than our highly optimized CPU version, depending on the implementation and the quality of random numbers. The other improvement targets the utilization of GPU clusters, where the 2D

Acknowledgements

We thank K. Binder, D. Stauffer, and M. Tacke for fruitful discussions. GPU time was provided on the NEC Nehalem Cluster by the High Performance Computing Center Stuttgart (HLRS). We are grateful to T. Bönisch, B. Krischok, and H. Pöhlmann for their support at the HLRS. This work was financially supported by the Deutsche Forschungsgemeinschaft (DFG) and benefited from the Gutenberg-Akademie and the Schwerpunkt für rechnergestützte Forschungsmethoden in den Naturwissenschaften and the

References (49)

  • D. Stauffer

    Kinetics of clusters in Ising-models

    Physica A

    (1992)
  • R. Zorn et al.

    Parallelization of the Ising simulation

    Computer Physics Communications

    (1981)
  • C. Harris et al.

    GPU accelerated radio astronomy signal convolution

    Experimental Astronomy

    (2008)
  • X. Gu et al.

    Implementation and evaluation of various demons deformable image registration algorithms on a GPU

    Physics in Medicine and Biology

    (2009)
  • X. Gu et al.

    GPU-based ultra-fast dose calculation using a finite size pencil beam model

    Physics in Medicine and Biology

    (2009)
  • D. Gross et al.

    GPU-based volume reconstruction from very few arbitrarily aligned X-ray images

    SIAM Journal on Scientific Computing

    (2009)
  • C.H. Men et al.

    GPU-based ultrafast IMRT plan optimization

    Physics in Medicine and Biology

    (2009)
  • M.S. Friedrichs et al.

    Accelerating molecular dynamic simulation on graphics processing units

    Journal of Computational Chemistry

    (2009)
  • J.A. van Meel et al.

    Harvesting graphics power for MD simulations

    Molecular Simulation

    (2008)
  • I.S. Ufimtsev et al.

    Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation

    Journal of Chemical Theory and Computation

    (2009)
  • L. Genovese et al.

    Density functional theory calculation on many-cores hybrid central processing unit–graphic processing unit architectures

    Journal of Chemical Physics

    (2009)
  • K. Yasuda

    Accelerating density functional calculations with graphics processing unit

    Journal of Chemical Theory and Computation

    (2008)
  • T. Preis et al.

    Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets

    New Journal of Physics

    (2009)
  • T. Preis et al.

    Fluctuation patterns in high-frequency financial asset returns

    Europhysics Letters

    (2008)
  • Cited by (103)

    • The effect of defects on magnetic droplet nucleation

      2023, Physica A: Statistical Mechanics and its Applications
    • High performance implementations of the 2D Ising model on GPUs

      2020, Computer Physics Communications
      Citation Excerpt :

      We compare our performance with those recently produced on other computing platforms [10,11]. Previous results are available in [12–14] and [15]. To begin, a basic single-GPU implementation of the checkerboard Metropolis algorithm was implemented in Python, combining features available in the popular Numba and CuPy packages.

    • GPU-based Ising computing for solving max-cut combinatorial optimization problems

      2019, Integration
      Citation Excerpt :

      The problem sizes in the design automation domain can easily accomplish this and heuristic methods can solve the Ising model in a parallel manner which makes the GPU ideal for this application. We remark that extensive work for Ising computing on GPUs have been proposed already [14–16], however; they still focus on physics problems which assume a nearest neighbor model only. This model is highly amenable to the GPU computing as it is easily load balanced across threads but is not general enough to handle problems such as max-cut without extensive pre-processing, again through graph embedding.

    View all citing articles on Scopus

    Source code of our implementations for GPU clusters will be published on http://www.tobiaspreis.de after acceptance. In addition, the code can be downloaded from the Google Code project multigpu-ising.

    View full text