Accelerating Monte Carlo simulations with an NVIDIA® graphics processor

doi:10.1016/j.cpc.2009.05.013

Computer Physics Communications

Volume 180, Issue 10, October 2009, Pages 1983-1989

https://doi.org/10.1016/j.cpc.2009.05.013 Get rights and content

Abstract

Modern graphics cards, commonly used in desktop computers, have evolved beyond a simple interface between processor and display to incorporate sophisticated calculation engines that can be applied to general purpose computing. The Monte Carlo algorithm for modelling photon transport in turbid media has been implemented on an NVIDIA^® 8800gt graphics card using the CUDA toolkit. The Monte Carlo method relies on following the trajectory of millions of photons through the sample, often taking hours or days to complete. The graphics-processor implementation, processing roughly 110 million scattering events per second, was found to run more than 70 times faster than a similar, single-threaded implementation on a 2.67 GHz desktop computer.

Program summary

Program title: Phoogle-C/Phoogle-G

Catalogue identifier: AEEB_v1_0

Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEEB_v1_0.html

Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland

Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html

No. of lines in distributed program, including test data, etc.: 51 264

No. of bytes in distributed program, including test data, etc.: 2 238 805

Distribution format: tar.gz

Programming language: C++

Computer: Designed for Intel PCs. Phoogle-G requires a NVIDIA graphics card with support for CUDA 1.1

Operating system: Windows XP

Has the code been vectorised or parallelized?: Phoogle-G is written for SIMD architectures

RAM: 1 GB

Classification: 21.1

External routines: Charles Karney Random number library. Microsoft Foundation Class library. NVIDA CUDA library [1].

Nature of problem: The Monte Carlo technique is an effective algorithm for exploring the propagation of light in turbid media. However, accurate results require tracing the path of many photons within the media. The independence of photons naturally lends the Monte Carlo technique to implementation on parallel architectures. Generally, parallel computing can be expensive, but recent advances in consumer grade graphics cards have opened the possibility of high-performance desktop parallel-computing.

Solution method: In this pair of programmes we have implemented the Monte Carlo algorithm described by Prahl et al. [2] for photon transport in infinite scattering media to compare the performance of two readily accessible architectures: a standard desktop PC and a consumer grade graphics card from NVIDIA.

Restrictions: The graphics card implementation uses single precision floating point numbers for all calculations. Only photon transport from an isotropic point-source is supported. The graphics-card version has no user interface. The simulation parameters must be set in the source code. The desktop version has a simple user interface; however some properties can only be accessed through an ActiveX client (such as Matlab).

Additional comments: The random number library used has a LGPL (http://www.gnu.org/copyleft/lesser.html) licence.

Running time: Runtime can range from minutes to months depending on the number of photons simulated and the optical properties of the medium.

References:

[1]
http://www.nvidia.com/object/cuda_home.html.
[2]
S. Prahl, M. Keijzer, Sl. Jacques, A. Welch, SPIE Institute Series 5 (1989) 102.

Introduction

Monte Carlo simulation is commonly used for modelling photon transport in turbid media [1], [2], [3], [4]. The gold standard for photon modelling, the Monte Carlo technique is often used for assessing the performance of models and analytic solutions to the radiation transport equation [5], [6], [7], the accuracy of experimental results [8], estimating power density in laser treatment and in solving the inverse problem to estimate optical properties from experimental measurements [9], [10], [11]. Its strength lies in simplicity. Individual photons are tracked as they propagate: scattered, absorbed, reflected and refracted by the medium using simple physical laws that permit ready modelling of sophisticated geometries. Its Achilles heel lies here too. Typically, millions of photons must be traced at each wavelength to obtain precise results requiring hours or days of computation time.

The Monte Carlo algorithm is well suited for parallel calculation, tracing photons simultaneously, and many implementations have been studied [12], [13], [14], [15], [16], [17], [18]. Two approaches are commonly employed: parallel computing and distributed computing. In a parallel computing environment, the processing hardware contains tens to thousands of processing units that often share resources such as memory. In a distributed computing environment, the program runs simultaneously on multiple computers communicating over a shared network. The latter often employs unused desktop machines that sit idle overnight. Until recently, parallel computing hardware has been characteristic of supercomputers and has been less readily accessible than networks of desktop machines, primarily due to capital cost. However, increasing demand for high-quality graphics from the entertainment industry has imbued desktop graphics co-processors with raw computation performance rivalling low-end supercomputers. For example, NVIDIA's^® (California, USA) latest graphics processor (June, 2008) the GTX280 claims a performance of nearly 10⁹ floating-point operations per second. While graphics processors are still several orders of magnitude below the top-500 supercomputers [19], their price (typically less than $US1000) offers very attractive performance per dollar. Graphics processors have been applied to speed up many algorithms from N-body simulations and microscope image registration to visualisation of white matter connectivity and solution of the time-independent Schrödinger equation with performance increases of up to 100 fold [20], [21], [22], [23], [24].

To benchmark the performance that might be realised for Monte Carlo simulation on graphics processor engines, we have implemented the Monte Carlo algorithm for photon transport from an isotropic point-source in an infinite, homogeneous, turbid medium using i) a desktop processor and ii) an NVIDIA 8800gt graphics processor. Relative performance of the two implementations is compared. Though the application presented involves tracing rays in a scattering environment, we expect this approach could also be applied to ray tracing for geometric optics.

Section snippets

Method

Our implementation of the Monte Carlo algorithm is based on the work by Prahl et al. [25], [26]. Briefly, the algorithm keeps track of a photons position, heading and probability of surviving sequential scattering and absorption events, updating these as the photon propagates through the medium (Fig. 1). The photon is launched, from the origin, with a survival probability of 1 that decreases at each scatter/absorption event until reaching a predetermined threshold (0.001 here), whereupon

Results and discussion

We have not heavily optimised either implementation favouring instead a straightforward implementation of the Monte Carlo algorithm for comparison of the relative performance of core- and graphics-processor platforms. Consequently higher performance is likely to be possible from both algorithms with careful optimization, such as careful code tuning and employing the techniques of Zolek et al. [35].

The tracing speed was estimated for different grid sizes to determine the optimal number of blocks

Conclusion

The Monte Carlo algorithm for simulating photon transport in turbid media has been implemented on a standard desktop computer and modern graphics processor to assess relative performance. The parallel graphics processor implementation was found to trace photons more than 70 times faster than the single-threaded desktop computer implementation, processing roughly 110 million scattering events per second. We believe this result suggests graphics cards offer a significant performance and cost

Acknowledgements

This work was supported with funding from the New Zealand Foundation for Research, Science and Technology (C06x0402).

References (35)

D. Fraser et al.
Postharvest Biol. Tec.
(2003)
D. Bates et al.
J. Quant. Spectrosc. Radiat. Transf.
(2008)
J. Wood et al.
Ann. Nucl. Energy
(1990)
A. Colasanti et al.
Comput. Phys. Comm.
(2000)
Y. Dewaraja et al.
Comput. Meth. Prog. Bio.
(2002)
J. Giersch et al.
Nucl. Instrum. Meth. A.
(2003)
M. Thomason et al.
Comput. Meth. Prog. Bio.
(2004)
T. Binzoni et al.
Prog. Bio.
(2008)
A. Anderson et al.
Comput. Phys. Comm.
(2007)
N. Zolek et al.
Prog. Bio.
(2006)

D. Rogers

Phys. Med. Biol.

(2006)

S. Flock et al.

IEEE T. Bio-Med. Eng.

(1989)

G. Palmer et al.

Appl. Opt.

(2006)

H. Xu et al.

Opt.

(2006)

F. Martelli et al.

Phys. Med. Biol.

(2000)

M. Nichols et al.

Appl. Opt.

(1997)

G. Palmer et al.

Appl. Opt.

(2006)

Cited by (20)

Parallelized Monte-Carlo dosimetry using graphics processing units to model cylindrical diffusers used in photodynamic therapy: From implementation to validation
2019, Photodiagnosis and Photodynamic Therapy
Citation Excerpt :
To compare our performance with other GPU devices, a specific Monte-Carlo program should be developed and optimized for each individual device. Although it remains difficult to compare this value to the accelerations obtained in other studies [15–17,48–57] due to different computing capacities and simulations, the acceleration provided by the Monte-Carlo method presented in this study is sufficient to enable use of the algorithm in routine clinical settings. Finally, the Monte-Carlo model presented here was validated using a device dedicated to intraoperative PDT, specifically a 70 mm long cylindrical diffuser located at the center of a balloon.
The Monte-Carlo method is the standard method for computing the dosimetry of both ionizing and non-ionizing radiation. Because this technique is highly time-consuming in conventional implementations, several improvements have recently been developed to speed-up simulations. Among the improvements, the use of graphics processing units (GPU) to parallelize algorithms provides a cost-efficient solution to accelerate the Monte-Carlo method.
Parallel implementation of Monte-Carlo using GPU technology is described in the context of photodynamic therapy (PDT) dosimetry. This algorithm has been optimized to compute light emitted from optical fibers with cylindrical diffusers that are used in interstitial PDT applications. A comparison of the experimental measurements used to assess the results of the Monte-Carlo method is detailed. Illumination profiles of several commercially available diffusers are measured using an optical phantom that mimics the optical properties of the brain. Additionally, this Monte-Carlo method is compared to ex-vivo measurements made by a device dedicated to intraoperative PDT treatment of brain tumors.
The results of the GPU Monte-Carlo validation are in accordance with the recommendations of the American Association of Physicists in Medicine. The acceleration obtained with the GPU implementation is in accordance with the literature and is sufficiently fast to be integrated in a treatment planning system dedicated to planning routine clinical interstitial PDT treatments.
Validated multi-wavelength simulations of light transport in healthy onion
2018, Computers and Electronics in Agriculture
Citation Excerpt :
A drawback is that, due to the scattering and absorption events, improving simulation accuracy away from the source requires an exponential increase in photon packets, and thus simulation time. This can be mitigated by using graphics processing hardware acceleration, for example Fang and Boas (2009) and Martinsen et al. (2009). An alternative method is the FEM, which can be used to solve the light density diffusion equation, for example as implemented in the software NIRFast (Dehghani et al., 2008) for Matlab processing (Matlab, Natick, MA).
Multi-wavelength light transport simulations of individual brown onions have been validated in the wavelength range 710–850 nm by comparing the simulations with experimental results. The simulations used the Finite Element Method (FEM) implemented in the software code NIRFast in Matlab. NIRFast required tissue geometry, optical scattering and absorption properties as inputs. The scattering values were obtained from Inverse Adding Doubling (IAD) measurements of onion slices. The absorption properties of several components were investigated. They included water, chlorophyll and onion absorption values measured with the IAD method or using transmission measurements on onion juice. These inputs to NIRFast enabled simulations of the transmission through model onions with different light attenuation characteristics. Error between simulation and measurement was minimized using absorption values from onion juice. An imposed realistic water concentration constraint (∼88%) removed cross talk between scattering and absorption. This resulted in a set of optical properties, based on a small set of component absorbers and Mie scattering theory, which accurately modelled light transport in each onion. The input absorption values were then checked for completeness with inverse simulations at each wavelength independently. Results indicated no major component was missing from the absorption model. The resulting optical property estimates were used to predict a range of transmission measurements from a high speed conveyor system.
AQUAgpusph, a new free 3D SPH solver accelerated with OpenCL
2015, Computer Physics Communications
Citation Excerpt :
However, computer simulations have benefited from an unexpected paradigm shift in recent years with the introduction of the computation on General-Purpose Graphics Processing Unit (GPGPU) devices, which allows for the use of graphic oriented processing units for more general purposes. These devices have the same computational capabilities as a medium-sized Central Processing Unit (CPU) based cluster for a fraction of its cost, a feature that quickly attracted the attention of science and engineering researchers [51,49,52]. Traditionally, these new paradigm based applications have been developed either in assembly language, or using a graphic implementation specific language such as Graphics Library Shading Language (GLSL), High Level Shader Language (HLSL) or C for Graphics (Cg).
In this paper, AQUAgpusph, a new free Smoothed Particle Hydrodynamics (SPH) software accelerated with OpenCL, is described. The main differences and progress with respect to other existing alternatives are considered. These are the use of the Open Computing Language (OpenCL) framework instead of the Compute Unified Device Architecture (CUDA), the implementation of the most popular boundary conditions, the easy customization of the code to different problems, the extensibility with regard to Python scripts, and the runtime output which allows the tracking of simulations in real time, or a higher frequency in saving some results without a significant performance lost. These modifications are shown to improve the solver speed, the results quality, and allow for a wider areas of application. AQUAgpusph has been designed trying to provide researchers and engineers with a valuable tool to test and apply the SPH method. Three practical applications are discussed in detail. The evolution of a dam break is used to quantify and compare the computational performance and modeling accuracy with the most popular SPH Graphics Processing Unit (GPU) accelerated alternatives. The dynamics of a coupled system, a Tuned Liquid Damper (TLD), is discussed in order to show the integration capabilities of the solver with external dynamics. Finally, the sloshing flow inside a nuclear reactor is simulated in order to show the capabilities of the solver to treat 3-D problems with complex geometries and of industrial interest.
Program title: AQUAgpusph 1.5
Catalogue identifier: AEVG_v1_0
Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEVG_v1_0.html
Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland
Licensing provisions: GNU General Public License, version 3
No. of lines in distributed program, including test data, etc.: 1702666
No. of bytes in distributed program, including test data, etc.: 75117178
Distribution format: tar.gz
Programming language: C++, OpenCL, Python.
Computer: Linux based computers with OpenCL support.
Operating system: Linux.
Has the code been vectorized or parallelized?: Code is parallelized with OpenCL.
Classification: 1.5.
Nature of problem: Complex geometry or heavily fragmented free surface fluid dynamics problems where mesh based method cannot be successfully applied.
Solution method: SPH is a meshless method where the fluid domain is discretized as a set of fluid particles. The fields in the fluid domain are smoothed using a kernel function, that allows to develop differential operators from the flow field values in scattered sets of particles.
Running time: Using an AMD HD-7970 graphic device 2×10⁵ time steps of a 2-D simulation, with 10⁵ particles and 8×10² neighs per particle, is requiring around 9 h of computation. A more detailed performance analysis will be carried out in the practical application section herein.
Accelerating solidification process simulation for large-sized system of liquid metal atoms using GPU with CUDA
2014, Journal of Computational Physics
Molecular dynamics simulation is a powerful tool to simulate and analyze complex physical processes and phenomena at atomic characteristic for predicting the natural time-evolution of a system of atoms. Precise simulation of physical processes has strong requirements both in the simulation size and computing timescale. Therefore, finding available computing resources is crucial to accelerate computation. However, a tremendous computational resource (GPGPU) are recently being utilized for general purpose computing due to its high performance of floating-point arithmetic operation, wide memory bandwidth and enhanced programmability. As for the most time-consuming component in MD simulation calculation during the case of studying liquid metal solidification processes, this paper presents a fine-grained spatial decomposition method to accelerate the computation of update of neighbor lists and interaction force calculation by take advantage of modern graphics processors units (GPU), enlarging the scale of the simulation system to a simulation system involving 10 000 000 atoms. In addition, a number of evaluations and tests, ranging from executions on different precision enabled-CUDA versions, over various types of GPU (NVIDIA 480GTX, 580GTX and M2050) to CPU clusters with different number of CPU cores are discussed. The experimental results demonstrate that GPU-based calculations are typically $9 \sim 11$ times faster than the corresponding sequential execution and approximately $1.5 \sim 2$ times faster than 16 CPU cores clusters implementations. On the basis of the simulated results, the comparisons between the theoretical results and the experimental ones are executed, and the good agreement between the two and more complete and larger cluster structures in the actual macroscopic materials are observed. Moreover, different nucleation and evolution mechanism of nano-clusters and nano-crystals formed in the processes of metal solidification is observed with large-sized system.
Solving the Boltzmann equation on GPUs
2011, Computer Physics Communications
We show how to accelerate the direct solution of the Boltzmann equation using Graphics Processing Units (GPUs). In order to fully exploit the computational power of the GPU, we choose a method of solution which combines a finite difference discretization of the free-streaming term with a Monte Carlo evaluation of the collision integral. The efficiency of the code is demonstrated by solving the two-dimensional driven cavity flow. Computational results show that it is possible to cut down the computing time of the sequential code of two order of magnitude. This makes the proposed method of solution a viable alternative to particle simulations for studying unsteady low Mach number flows.
Octree-based, GPU implementation of a continuous cellular automaton for the simulation of complex, evolving surfaces
2011, Computer Physics Communications
Citation Excerpt :
Specialized parallel hardware, such as that of a GPU, has been continuously introduced in recent years for the simulation of a wide range modeling methods. Relevant examples of this fact are parallel implementations of Level Set Method [28], Finite-Difference Time-Domain (FDTD) method and the Finite Element Method (FEM) [29,30], Cellular Automata [17], Lagrangian Models [31], Monte Carlo simulations [32] and Molecular Dynamics [33]. Although in these examples the speed of the simulations is improved by adapting the algorithms to the new processors, adapting also the supporting data structures, such as the octree, is an essential part of the optimization effort.
Presently, dynamic surface-based models are required to contain increasingly larger numbers of points and to propagate them over longer time periods. For large numbers of surface points, the octree data structure can be used as a balance between low memory occupation and relatively rapid access to the stored data. For evolution rules that depend on neighborhood states, extended simulation periods can be obtained by using simplified atomistic propagation models, such as the Cellular Automata (CA). This method, however, has an intrinsic parallel updating nature and the corresponding simulations are highly inefficient when performed on classical Central Processing Units (CPUs), which are designed for the sequential execution of tasks. In this paper, a series of guidelines is presented for the efficient adaptation of octree-based, CA simulations of complex, evolving surfaces into massively parallel computing hardware. A Graphics Processing Unit (GPU) is used as a cost-efficient example of the parallel architectures. For the actual simulations, we consider the surface propagation during anisotropic wet chemical etching of silicon as a computationally challenging process with a wide-spread use in microengineering applications. A continuous CA model that is intrinsically parallel in nature is used for the time evolution. Our study strongly indicates that parallel computations of dynamically evolving surfaces simulated using CA methods are significantly benefited by the incorporation of octrees as support data structures, substantially decreasing the overall computational time and memory usage.

View all citing articles on Scopus

^☆: This paper and its associated computer program are available via the Computer Physics Communications homepage on ScienceDirect (http://www.sciencedirect.com/science/journal/00104655).

View full text

Accelerating Monte Carlo simulations with an NVIDIA® graphics processor☆

Abstract

Program summary

Introduction

Section snippets

Method

Results and discussion

Conclusion

Acknowledgements

Postharvest Biol. Tec.

J. Quant. Spectrosc. Radiat. Transf.

Ann. Nucl. Energy

Comput. Phys. Comm.

Comput. Meth. Prog. Bio.

Nucl. Instrum. Meth. A.

Comput. Meth. Prog. Bio.

Prog. Bio.

Comput. Phys. Comm.

Prog. Bio.

Phys. Med. Biol.

IEEE T. Bio-Med. Eng.

Appl. Opt.

Opt.

Phys. Med. Biol.

Appl. Opt.

Appl. Opt.

Accelerating Monte Carlo simulations with an NVIDIA^® graphics processor☆