Accelerating Monte Carlo simulations with an NVIDIA® graphics processor

https://doi.org/10.1016/j.cpc.2009.05.013Get rights and content

Abstract

Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA® 8800gt graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer.

Program summary

Program title: Phoogle-C/Phoogle-G

Catalogue identifier: AEEB_v1_0

Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEEB_v1_0.html

Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland

Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html

No. of lines in distributed program, including test data, etc.: 51 264

No. of bytes in distributed program, including test data, etc.: 2 238 805

Distribution format: tar.gz

Programming language: C++

Computer: Designed for Intel PCs. Phoogle-G requires a NVIDIA graphics card with support for CUDA 1.1

Operating system: Windows XP

Has the code been vectorised or parallelized?: Phoogle-G is written for SIMD architectures

RAM: 1 GB

Classification: 21.1

External routines: Charles Karney Random number library. Microsoft Foundation Class library. NVIDA CUDA library [1].

Nature of problem: The Monte Carlo technique is an effective algorithm for exploring the propagation of light in turbid media. However, accurate results require tracing the path of many photons within the media. The independence of photons naturally lends the Monte Carlo technique to implementation on parallel architectures. Generally, parallel computing can be expensive, but recent advances in consumer grade graphics cards have opened the possibility of high-performance desktop parallel-computing.

Solution method: In this pair of programmes we have implemented the Monte Carlo algorithm described by Prahl et al. [2] for photon transport in infinite scattering media to compare the performance of two readily accessible architectures: a standard desktop PC and a consumer grade graphics card from NVIDIA.

Restrictions: The graphics card implementation uses single precision floating point numbers for all calculations. Only photon transport from an isotropic point-source is supported. The graphics-card version has no user interface. The simulation parameters must be set in the source code. The desktop version has a simple user interface; however some properties can only be accessed through an ActiveX client (such as Matlab).

Additional comments: The random number library used has a LGPL (http://www.gnu.org/copyleft/lesser.html) licence.

Running time: Runtime can range from minutes to months depending on the number of photons simulated and the optical properties of the medium.

References:

Introduction

Monte Carlo simulation is commonly used for modelling photon transport in turbid media [1], [2], [3], [4]. The gold standard for photon modelling, the Monte Carlo technique is often used for assessing the performance of models and analytic solutions to the radiation transport equation [5], [6], [7], the accuracy of experimental results [8], estimating power density in laser treatment and in solving the inverse problem to estimate optical properties from experimental measurements [9], [10], [11]. Its strength lies in simplicity. Individual photons are tracked as they propagate: scattered, absorbed, reflected and refracted by the medium using simple physical laws that permit ready modelling of sophisticated geometries. Its Achilles heel lies here too. Typically, millions of photons must be traced at each wavelength to obtain precise results requiring hours or days of computation time.

The Monte Carlo algorithm is well suited for parallel calculation, tracing photons simultaneously, and many implementations have been studied [12], [13], [14], [15], [16], [17], [18]. Two approaches are commonly employed: parallel computing and distributed computing. In a parallel computing environment, the processing hardware contains tens to thousands of processing units that often share resources such as memory. In a distributed computing environment, the program runs simultaneously on multiple computers communicating over a shared network. The latter often employs unused desktop machines that sit idle overnight. Until recently, parallel computing hardware has been characteristic of supercomputers and has been less readily accessible than networks of desktop machines, primarily due to capital cost. However, increasing demand for high-quality graphics from the entertainment industry has imbued desktop graphics co-processors with raw computation performance rivalling low-end supercomputers. For example, NVIDIA's® (California, USA) latest graphics processor (June, 2008) the GTX280 claims a performance of nearly 109 floating-point operations per second. While graphics processors are still several orders of magnitude below the top-500 supercomputers [19], their price (typically less than $US1000) offers very attractive performance per dollar. Graphics processors have been applied to speed up many algorithms from N-body simulations and microscope image registration to visualisation of white matter connectivity and solution of the time-independent Schrödinger equation with performance increases of up to 100 fold [20], [21], [22], [23], [24].

To benchmark the performance that might be realised for Monte Carlo simulation on graphics processor engines, we have implemented the Monte Carlo algorithm for photon transport from an isotropic point-source in an infinite, homogeneous, turbid medium using i) a desktop processor and ii) an NVIDIA 8800gt graphics processor. Relative performance of the two implementations is compared. Though the application presented involves tracing rays in a scattering environment, we expect this approach could also be applied to ray tracing for geometric optics.

Section snippets

Method

Our implementation of the Monte Carlo algorithm is based on the work by Prahl et al. [25], [26]. Briefly, the algorithm keeps track of a photons position, heading and probability of surviving sequential scattering and absorption events, updating these as the photon propagates through the medium (Fig. 1). The photon is launched, from the origin, with a survival probability of 1 that decreases at each scatter/absorption event until reaching a predetermined threshold (0.001 here), whereupon

Results and discussion

We have not heavily optimised either implementation favouring instead a straightforward implementation of the Monte Carlo algorithm for comparison of the relative performance of core- and graphics-processor platforms. Consequently higher performance is likely to be possible from both algorithms with careful optimization, such as careful code tuning and employing the techniques of Zolek et al. [35].

The tracing speed was estimated for different grid sizes to determine the optimal number of blocks

Conclusion

The Monte Carlo algorithm for simulating photon transport in turbid media has been implemented on a standard desktop computer and modern graphics processor to assess relative performance. The parallel graphics processor implementation was found to trace photons more than 70 times faster than the single-threaded desktop computer implementation, processing roughly 110 million scattering events per second. We believe this result suggests graphics cards offer a significant performance and cost

Acknowledgements

This work was supported with funding from the New Zealand Foundation for Research, Science and Technology (C06x0402).

References (35)

  • D. Fraser et al.

    Postharvest Biol. Tec.

    (2003)
  • D. Bates et al.

    J. Quant. Spectrosc. Radiat. Transf.

    (2008)
  • J. Wood et al.

    Ann. Nucl. Energy

    (1990)
  • A. Colasanti et al.

    Comput. Phys. Comm.

    (2000)
  • Y. Dewaraja et al.

    Comput. Meth. Prog. Bio.

    (2002)
  • J. Giersch et al.

    Nucl. Instrum. Meth. A.

    (2003)
  • M. Thomason et al.

    Comput. Meth. Prog. Bio.

    (2004)
  • T. Binzoni et al.

    Prog. Bio.

    (2008)
  • A. Anderson et al.

    Comput. Phys. Comm.

    (2007)
  • N. Zolek et al.

    Prog. Bio.

    (2006)
  • D. Rogers

    Phys. Med. Biol.

    (2006)
  • S. Flock et al.

    IEEE T. Bio-Med. Eng.

    (1989)
  • G. Palmer et al.

    Appl. Opt.

    (2006)
  • H. Xu et al.

    Opt.

    (2006)
  • F. Martelli et al.

    Phys. Med. Biol.

    (2000)
  • M. Nichols et al.

    Appl. Opt.

    (1997)
  • G. Palmer et al.

    Appl. Opt.

    (2006)
  • Cited by (20)

    • Parallelized Monte-Carlo dosimetry using graphics processing units to model cylindrical diffusers used in photodynamic therapy: From implementation to validation

      2019, Photodiagnosis and Photodynamic Therapy
      Citation Excerpt :

      To compare our performance with other GPU devices, a specific Monte-Carlo program should be developed and optimized for each individual device. Although it remains difficult to compare this value to the accelerations obtained in other studies [15–17,48–57] due to different computing capacities and simulations, the acceleration provided by the Monte-Carlo method presented in this study is sufficient to enable use of the algorithm in routine clinical settings. Finally, the Monte-Carlo model presented here was validated using a device dedicated to intraoperative PDT, specifically a 70 mm long cylindrical diffuser located at the center of a balloon.

    • Validated multi-wavelength simulations of light transport in healthy onion

      2018, Computers and Electronics in Agriculture
      Citation Excerpt :

      A drawback is that, due to the scattering and absorption events, improving simulation accuracy away from the source requires an exponential increase in photon packets, and thus simulation time. This can be mitigated by using graphics processing hardware acceleration, for example Fang and Boas (2009) and Martinsen et al. (2009). An alternative method is the FEM, which can be used to solve the light density diffusion equation, for example as implemented in the software NIRFast (Dehghani et al., 2008) for Matlab processing (Matlab, Natick, MA).

    • AQUAgpusph, a new free 3D SPH solver accelerated with OpenCL

      2015, Computer Physics Communications
      Citation Excerpt :

      However, computer simulations have benefited from an unexpected paradigm shift in recent years with the introduction of the computation on General-Purpose Graphics Processing Unit (GPGPU) devices, which allows for the use of graphic oriented processing units for more general purposes. These devices have the same computational capabilities as a medium-sized Central Processing Unit (CPU) based cluster for a fraction of its cost, a feature that quickly attracted the attention of science and engineering researchers [51,49,52]. Traditionally, these new paradigm based applications have been developed either in assembly language, or using a graphic implementation specific language such as Graphics Library Shading Language (GLSL), High Level Shader Language (HLSL) or C for Graphics (Cg).

    • Solving the Boltzmann equation on GPUs

      2011, Computer Physics Communications
    • Octree-based, GPU implementation of a continuous cellular automaton for the simulation of complex, evolving surfaces

      2011, Computer Physics Communications
      Citation Excerpt :

      Specialized parallel hardware, such as that of a GPU, has been continuously introduced in recent years for the simulation of a wide range modeling methods. Relevant examples of this fact are parallel implementations of Level Set Method [28], Finite-Difference Time-Domain (FDTD) method and the Finite Element Method (FEM) [29,30], Cellular Automata [17], Lagrangian Models [31], Monte Carlo simulations [32] and Molecular Dynamics [33]. Although in these examples the speed of the simulations is improved by adapting the algorithms to the new processors, adapting also the supporting data structures, such as the octree, is an essential part of the optimization effort.

    View all citing articles on Scopus

    This paper and its associated computer program are available via the Computer Physics Communications homepage on ScienceDirect (http://www.sciencedirect.com/science/journal/00104655).

    View full text