GPU-accelerated Gibbs ensemble Monte Carlo simulations of Lennard-Jonesium

https://doi.org/10.1016/j.cpc.2013.06.020Get rights and content

Abstract

This work describes an implementation of canonical and Gibbs ensemble Monte Carlo simulations on graphics processing units (GPUs). The pair-wise energy calculations, which consume the majority of the computational effort, are parallelized using the energetic decomposition algorithm. While energetic decomposition is relatively inefficient for traditional CPU-bound codes, the algorithm is ideally suited to the architecture of the GPU. The performance of the CPU and GPU codes are assessed for a variety of CPU and GPU combinations for systems containing between 512 and 131,072 particles. For a system of 131,072 particles, the GPU-enabled canonical and Gibbs ensemble codes were 10.3 and 29.1 times faster (GTX 480 GPU vs. i5-2500K CPU), respectively, than an optimized serial CPU-bound code. Due to overhead from memory transfers from system RAM to the GPU, the CPU code was slightly faster than the GPU code for simulations containing less than 600 particles. The critical temperature Tc=1.312(2) and density ρc=0.316(3) were determined for the tail corrected Lennard-Jones potential from simulations of 10,000 particle systems, and found to be in exact agreement with prior mixed field finite-size scaling calculations [J.J. Potoff, A.Z. Panagiotopoulos, J. Chem. Phys. 109 (1998) 10914].

Introduction

Over the last 50 years, Monte Carlo simulations have been used extensively to provide fundamental insight regarding the relationship between atomic-level interactions and macro-scale phenomena. In the late 1980s through the 1990s, a proliferation of new algorithms, including Gibbs ensemble  [1], [2], configurational-bias[3], [4], [5], [6], [7], Gibbs–Duhem integration  [8], and histogram-reweighting techniques  [9], [10], [11], [12] simplified greatly the determination of vapor–liquid coexistence curves. The Gibbs ensemble method is particularly versatile; in addition to vapor–liquid coexistence, this method has been used to determine adsorption equilibria in single  [13], [14], [15], [16] and multi-component systems  [17], [18], membrane equilibria, solid–fluid  [19] and solid–vapor  [20], [21] equilibria. Unlike molecular dynamics, where highly optimized parallel codes are widely available to the public  [22], [23], [24], [25], [26], most Gibbs ensemble Monte Carlo calculations (GEMC) have been performed with codes that execute on a single CPU core. This limits calculations to relatively small system sizes (<5000 atoms), and precludes the use of GEMC for the simulation of equilibria in most biological systems.

There have been numerous attempts to parallelize Monte Carlo simulations based on three core algorithms, embarrassingly parallel, farm, and domain decomposition. The embarrassingly parallel approach is the simplest to implement; a replica of the system is placed on each processor core, and each instance runs independently without any communication between processors. For systems with short equilibration periods, parallel efficiencies near 100% are possible. However, for systems with long equilibration periods, this method quickly becomes inefficient. In the farm algorithm  [27], also known as energetic decomposition, the energy calculation is distributed over the available processing cores. Some efforts have shown improved performance with this method  [28], but the farm algorithm has been avoided due to communications overhead noted in early comparative studies between energetic decomposition and embarrassingly parallel algorithms  [29]. Domain decomposition is an example of a modified Markov chain algorithm  [30], and leverages the fact that in large systems, particles outside the range of interactions of other particles may be moved simultaneously. The simulation box is split into cells, and interactions on each cell are calculated on a single CPU core. The amount of inter-processor communication is reduced compared to the farm method. In early work, domain decomposition was dismissed as inefficient, since its performance was worse than the embarrassingly parallel method  [31]. More recently, however, it has been shown that by using a sequential updating algorithm the parallel efficiency of domain decomposition may be improved significantly  [32], [33], [34]. Additional modified Markov chain algorithms have been proposed where displacement of all particles, or a subset of particles, is attempted simultaneously  [35]. These methods are referred to as hybrid Monte Carlo methods, due to the use of velocities to determine new trial locations, which are typically used in Newtonian-physics based molecular dynamics simulations. The efficiency of hybrid Monte Carlo methods tends to be dictated by the phase  [36] and type of atoms or molecules studied  [37].

In addition to general methods of parallelizing atomistic Monte Carlo simulations, some specialized ensemble-specific techniques have been demonstrated. Chen and Hirtzel proposed a macro state Markov chain model (MSMCM) for parallelization of simulations in the grand canonical ensemble  [38], [39], [40]. Spatial updating algorithms, combined with sequential updating and domain decomposition, have been used to improve the efficiency of parallel Monte Carlo simulations in the grand canonical ensemble  [34]. Parallel grand canonical Monte Carlo simulations have been combined with molecular dynamics for the simulation of diffusion in porous materials  [41], [42]. Additional parallelization efforts have been focused on the configurational-bias algorithm, which is crucial to achieving reasonable acceptance rates for molecule transfers in grand canonical and Gibbs ensemble Monte Carlo simulations. In order to improve the success rate for growing chain molecules in a dense system, Esselink et al. proposed evaluating multiple trial locations for the first bead in parallel  [43]. This methodology for particle exchange was combined with hybrid Monte Carlo moves for particle displacement in Gibbs ensemble Monte Carlo where a parallel efficiency of approximately 80% was observed  [36], [43].

In recent years the field of parallel computing has shifted its focus from the CPU to a new kind of hardware, the graphics processing unit (GPU), which was originally created satisfy the demand for increasing realism in virtual video game environments  [44]. GPUs have outpaced CPU in terms of units of computational power per electrical power consumption and computational power per discrete hardware costs. GPUs typically devote more transistors to fast parallel math processing than CPU at the expense of flexibility. Compared to the relatively small collection of complex processing cores found in modern CPU, GPUs are composed of a collection of simpler processors (streaming multiprocessors–“SMs”–each composed of dozens of shader core subunits) capable of massively parallel math operations. GPUs have several types of volatile storage available for simulation variables—DRAM, constant cache memory, global cache memory, shared memory and registers. DRAM has the highest latency, while registers offer the lowest latency. However, the DRAM bank has significantly larger capacity than the registers or cache, hence part of the challenge in developing efficient algorithms to utilize the GPUs is optimizing variable distribution and memory throughput.

Given inherent hardware differences between multi-core CPU and GPU, the creation of optimized GPU algorithms requires a fundamental reexamination of parallelization methods. Some methods that appear less viable for CPU parallelization may perform well on the GPU, and the converse may also be true. Additionally, there is an opportunity to develop new algorithms/techniques, which were not considered or postulated due to the architectural differences between single instruction multiple data (SIMD) devices and the traditional CPU-based multiple-instruction multiple-data (MIMD) clusters and supercomputers. Despite relatively mature GPU-enabled molecular dynamics simulation engines[25], [45], [46], [47], [48], [49], there has been significantly less exploration of Monte Carlo simulations on GPU. As with molecular dynamics, GPGPU computing is expected to bring sizeable benefits to Monte Carlo simulations. Early work has focused on simple systems, due to the relative complexity of writing optimal GPU code. 2D and 3D simulations of Ising models were recently accelerated using single  [50] and multiple GPUs  [51]. GPU-driven canonical ensemble simulations of hard spheres  [52], have also been published. Canonical ensemble simulations of methane in a zeolite framework have been performed on GPU for small systems (Nmethane128 particles)  [53]. This code was later expanded to perform GPU-accelerated grand canonical Monte Carlo simulations of methane and CO2 in a zeolite  [54], [55], [56].

In this work, an implementation of Gibbs ensemble Monte Carlo on the GPU is presented. A straightforward implementation of the farm algorithm is used for parallelization of the pair-wise energy calculations for the three distinct move types: particle displacement, swap and volume exchange. The choice of the farm algorithm is motivated by the fact that for complex molecules, such as those found in biomolecular simulations, long equilibration periods are required for adequate sampling of phase space, which reduces significantly the efficiency of embarrassingly parallel methods. Simulations of biomolecular systems present an additional problem of system size. While Monte Carlo simulations for phase equilibria and physical property prediction are performed typically for systems of 5000 atoms or fewer, typical biomolecular simulations contain tens to hundreds of thousands of atoms. Therefore it is important to have an implementation that scales well with system size. As a demonstration, calculations are performed for systems of Lennard-Jones beads ranging in size from 512 to 131,072 particles. While the properties of the Lennard-Jones fluid may be studied easily with a serial code, the wealth of data in the literature makes this the ideal system for the evaluation of new methodologies. Calculated energies, pressures, saturated liquid and vapor densities are presented for a range of temperatures and compared to literature data for validation. Detailed assessments of the performance of the GPU-accelerated GEMC code relative to an equivalently designed and optimized serial code are presented for a range of GPU and CPU. Additional data are presented for simulations in the canonical ensemble.

Section snippets

GPU implementation

The Gibbs ensemble methodology utilizes two simulation boxes, each representing a region of fluid deep within their respective phases (vapor, liquid)  [1]. The three criteria for equilibria, equality of temperature, pressure and chemical potential between phases, are satisfied through three types of moves: particle displacement within a phase, volume exchange and particle swaps between phases. Acceptance probabilities for each move are defined as follows displacementacc(on)=eβ[UnewUold]

Simulation details

In this work, interactions between particles are governed by the ubiquitous Lennard-Jones potential U(rij)=4ϵij[(σijrij)12(σijrij)6] where U is the configurational energy of the pair interaction, εij is the well depth, σij is the particle diameter and rij is the distance between particles i and j. Reduced quantities were defined as T=kBT/ε,ρ=Nσ3/V,U=U/ε,p=pσ3/ε.

Simulations were performed for both the truncated and tail-corrected Lennard-Jones potential. Analytical tail corrections were

Results and discussion

The GPU accelerated Gibbs ensemble Monte Carlo engine was used to determine vapor–liquid coexistence curves for the truncated and long range corrected Lennard-Jones potential for systems containing 10,000 atoms. These data are shown in Fig. 3, and show excellent agreement with the original Gibbs ensemble results of Panagiotopoulos for systems containing 500 atoms  [1], as well as more modern approaches using grand canonical Monte Carlo simulations coupled with histogram reweighting methods  [57]

Conclusions

In this work, an implementation of Gibbs ensemble Monte Carlo on the GPU has been demonstrated. Parallelization of routines to calculate pair-wise energies was performed using the method of energetic decomposition, where each pair interaction was calculated by a particular thread. This method is naturally well suited to the architecture of the GPU, but has limitations. For small system sizes, N<600 particles, it was found that overhead from memory transfers between system RAM and the GPU led to

Acknowledgments

Financial support from the National Science Foundation (NSF CBET-0730786, OCI-1148168) and the Wayne State University Research Enhancement Program is gratefully acknowledged.

References (63)

  • X. Peng et al.

    J. Colloid Interface Sci.

    (2007)
  • T. Okayama et al.

    Fluid Phase Equilib.

    (1995)
  • S. Plimpton

    J. Comput. Phys.

    (1995)
  • J.A. Anderson et al.

    J. Comput. Phys.

    (2008)
  • G.S. Heffelfinger

    Comput. Phys. Comm.

    (2000)
  • F.A. Brotz et al.

    Chem. Eng. Sci.

    (1994)
  • A. Chen et al.

    Sep. Technol.

    (1994)
  • T.D. Nguyen et al.

    Comput. Phys. Comm.

    (2011)
  • C.L. Phillips et al.

    J. Comput. Phys.

    (2011)
  • W.M. Brown et al.

    Comput. Phys. Comm.

    (2011)
  • J.C. Phillips et al.

    Adapting a message-driven parallel application to GPU-accelerated clusters

  • T. Preis et al.

    J. Comput. Phys.

    (2009)
  • B. Block et al.

    Comput. Phys. Comm.

    (2010)
  • W. Shi et al.

    Fluid Phase Equilib.

    (2001)
  • A.Z. Panagiotopoulos

    Mol. Phys.

    (1987)
  • B. Smit et al.

    Mol. Phys.

    (1989)
  • J.J. de Pablo et al.

    Mol. Phys.

    (1993)
  • M.G. Martin et al.

    J. Phys. Chem. B

    (1999)
  • J.I. Siepmann et al.

    Mol. Phys.

    (1992)
  • T.J.H. Vlugt et al.

    Mol. Phys.

    (1998)
  • C.D. Wick et al.

    Macromolecules

    (2000)
  • D.A. Kofke

    Mol. Phys.

    (1993)
  • A.M. Ferrenberg et al.

    Phys. Rev. Lett.

    (1988)
  • J.J. Potoff et al.

    Mol. Phys.

    (1999)
  • A.M. Ferrenberg et al.

    J. Appl. Phys.

    (1991)
  • A.M. Ferrenberg et al.

    Phys. Rev. Lett.

    (1989)
  • A.Z. Panagiotopoulos

    Mol. Phys.

    (1987)
  • S.C. McGrother et al.

    Mol. Phys.

    (1999)
  • C. Lastoskie et al.

    Langmuir

    (1993)
  • J.W. Jiang et al.

    Langmuir

    (2003)
  • M.B. Sweatman et al.

    Mol. Simul.

    (2004)
  • Cited by (28)

    • A GPU-based large-scale Monte Carlo simulation method for systems with long-range interactions

      2017, Journal of Computational Physics
      Citation Excerpt :

      A. Yaseen and Y. Li used the remapping method to calculate the total energy on GPU for protein systems [37]. A group at Wayne State University [39,40] realized a GPU code for Gibbs ensemble MC simulation of simple liquids. J. Kim et al. developed an implementation with embarrassing parallel on GPU, where each block performs an individual Monte Carlo simulation [41–43].

    • Scalable Metropolis Monte Carlo for simulation of hard shapes

      2016, Computer Physics Communications
      Citation Excerpt :

      Two recent open source codes fall into the second category. CASSANDRA [24] uses OpenMP to run in parallel on the CPU and GOMC [25] uses CUDA to parallelize on NVIDIA GPUs. Both of these tools model atomistic systems with classical potentials.

    View all citing articles on Scopus
    View full text