A fast GPU-based Monte Carlo simulation of proton transport with detailed modeling of non-elastic interactions

Purpose: Very fast Monte Carlo (MC) simulations of proton transport have been implemented recently on GPUs. However, these usually use simplified models for non-elastic (NE) proton-nucleus interactions. Our primary goal is to build a GPU-based proton transport MC with detailed modeling of elastic and NE collisions. Methods: Using CUDA, we implemented GPU kernels for these tasks: (1) Simulation of spots from our scanning nozzle configurations, (2) Proton propagation through CT geometry, considering nuclear elastic scattering, multiple scattering, and energy loss straggling, (3) Modeling of the intranuclear cascade stage of NE interactions, (4) Nuclear evaporation simulation, and (5) Statistical error estimates on the dose. To validate our MC, we performed: (1) Secondary particle yield calculations in NE collisions, (2) Dose calculations in homogeneous phantoms, (3) Re-calculations of head and neck plans from a commercial treatment planning system (TPS), and compared with Geant4.9.6p2/TOPAS. Results: Yields, energy and angular distributions of secondaries from NE collisions on various nuclei agree well with the Geant4 Bertini and Binary cascade models. The 3D-gamma pass rate at 2\%-2 mm for treatment plan simulations is typically 98\%. The net calculation time on a NVIDIA GTX680 card, including all data transfers, is $\sim$20 s for $1\times10^7$ proton histories. Conclusions: Our GPU-based MC is the first of its kind to include a detailed nuclear model to handle NE interactions of protons with any nucleus. Dosimetric calculations are in very good agreement with Geant4/TOPAS. Our MC is being used to perform fast routine clinical QA of pencil-beam based treatment plans, and has also been adopted as the dose engine in a clinically-applicable MC-based IMPT TPS. The detailed nuclear modeling will allow us to perform very fast linear energy transfer and neutron dose estimates on the GPU.


I. INTRODUCTION
The ability to perform fast and accurate dose calculations is of great importance to modern radiation therapy, especially in proton therapy, where range uncertainties can be critical. In terms of accuracy, it is widely accepted that Monte Carlo (MC) techniques are the gold standard for dose calculations. MC simulations of proton treatment plans have previously been performed using well-established packages such as Geant4, FLUKA and MCNPX [1][2][3]. The improved accuracy of MC simulations compared to pencil beam calculations has been explicitly demonstrated [1]. Because of the enhanced accuracy, MC dose calculations can be used to verify results from pencil beam-based treatment planning systems (TPS), hence mitigating the effects of proton range uncertainties [4]. Unfortunately, computational times associated with MC-based treatment plan simulations using Geant4, FLUKA and MCNPX can be prohibitive. For example, a TOPAS [5] simulation of the dose within 2% statistical error (∼1×10 7 particle histories) in a 150 cm 3 tumor volume typically requires around 1 hour on a 100-node CPU cluster. This makes routine clinical QA of treatment plans very difficult. At best, MC evaluations can be reserved for a handful of cases, and even so, only if significant computing resources are available.
Recently, several attempts to implement condensed history MC simulations of proton transport on graphics processing units (GPUs) have been made, with very encouraging results [6][7][8][9]. Using Berger's classification, these attempts can be grouped as class I [6,8,9] and II [7]. The typical processing time for 1 × 10 7 particle histories is reported to be around 2-30 s. The impressive speed gain compared to CPU-based calculations is due to both algorithmic simplification and hardware acceleration. By far the most important approximation in refs. [6][7][8][9] concerns proton-nucleus collisions. Su et al. [8] and Osiecki et al. [9] have computed proton depth-dose curves in water on the GPU, without any nuclear interactions. Kohno et al. [6] indirectly consider nuclear processes by using measured proton depth-dose curves in water to determine energy losses during particle stepping. Jia et al. [7] use a simplified model of non-elastic nuclear interactions (based on the work of Fippel and Soukup [10]), which can predict the energy spectrum of secondary protons from p- 16 O collisions. The composition of all materials is assumed to be identical to water, and hence, only non-elastic collisions on 16 O are considered. Their dose calculations in bone and lowdensity tissue demonstrate good agreement with Geant4, despite the lack of an in-depth model of nuclear interactions.
However, in certain situations a more accurate model of nuclear processes is critical, e.g.
particle propagation in patients with sizeable metallic implants, detailed studies of secondary particle production, neutron dose estimates and dose-averaged linear energy transfer (LET d ) calculations. A non-elastic proton-nucleus interaction can be simulated with the MC method as a two-step process: a fast intranuclear (INC) cascade that leaves the nucleus in an excited state, followed by an evaporation stage [11]. Traditional CPU-based simulations are rather time-consuming: it takes ∼200s to compute 1 × 10 6 200 MeV p-16 O non-elastic interactions 1 with the Bertini INC model in Geant4.9.6p2 and the Liège INC model [12] (INCL++ v5.1), respectively. In an effort to accelerate these calculations, we recently deployed a Bertini-style simulation of non-elastic interactions on the GPU [13], achieving a speed-up factor of ∼50 compared to the Geant4 Bertini model.
In this paper, we present a GPU-based class II condensed history proton transport MC, which uses our previously reported INC and evaporation kernels 2 [13] to simulate non-elastic 3 collisions with any nucleus, on an event-by-event basis. Elastic collisions of protons with nuclei other than hydrogen and oxygen are also modeled. Very good agreement is obtained with Geant4.9.6p2/TOPAS. On a NVIDIA GTX680 card, we report a net calculation time (including all CPU operations and CPU-GPU data transfers) of ∼20 s for a typical plan with 1 × 10 7 histories. Previous authors [6][7][8][9] report computational times that are of the same order. Compared with previous work, our simulation therefore contains a much more comprehensive treatment of nuclear processes, while achieving comparable net computational times.
This paper is organized as follows. In §II A, we describe our physics model and all the major components of our simulation. This includes realistic beam spot modeling, energy loss, multiple scattering, energy straggling, INC, nuclear evaporation, elastic nuclear scattering and transport mechanics. In §II B, the GPU implementation is described: our parallelization strategy ( §II B 1), the software structure ( §II B 2) and GPU memory management ( §II B 3).
We give details of our validation methods and of our treatment plan re-calculation workflow 1 This is roughly the expected number of non-elastic events for 1 × 10 7 proton histories. 2 In this paper, 'kernel' refers to a piece of code that runs on the GPU but called from the host CPU. 3 The ICRU definitions for the terms 'elastic', 'non-elastic' and 'inelastic' are adopted in this paper [14].
in §II C and II D respectively. In §III, we report on the validation results, comparing our MC simulation results with Geant4.9.6p2/TOPAS: (1) predictions from our model of non-elastic nuclear interactions ( §III A), (2) pencil beam dose distributions in homogeneous phantoms ( §III B), and (3) MC re-calculations of complex head and neck treatment plans ( §III C).
We then discuss a number of current and future applications of our work in §IV, before summarizing and concluding in §V.

Geometry
We focus only on transport through voxelized geometries. Patient geometry is implemented from CT scans, using the formalism of Schneider et al. [15] to convert the Hounsfield Unit (HU) of each voxel to material density and composition. The HU to density conversion factors are the same as in ref. [15]. Following ref. [1], in the HU range between -1000 and 3060 we use 26 materials comprised of a combination of up to 12 elements, with everything above 3060 assumed to be metallic implant (titanium by default).

Stopping powers, multiple scattering, energy straggling
Total stopping power (dE/dx) tables for protons in water, air and titanium are obtained from Geant4.9.6p2. Proton range-energy tables are then generated via numerical integration in the continuous slowing down approximation (CSDA). Second derivatives are also precalculated so that the tables can be cubic-spline interpolated. The stopping power of a proton of energy E in material m with −1000 < HU < 3060 is related to S w (E), the stopping power in water, by: where ρ m is the density of material, ρ w is the density of water, and f m is a scaling factor.
f m is tabulated as a function of HU using Geant4.9.6p2. For air (HU = -1000) and titanium (HU ≥ 3060), this scaling factor is not applied, and pre-generated tables are used directly instead.
Multiple scattering in each condensed history step is simulated using the Highland formula [16]. A multiplicative factor of 1.07 is added to obtain comparable angular spread with Geant4.9.6p2, which employs the Urban multiple scattering model [17]. Radiation lengths for all defined materials are pre-calculated with Geant4.9.6p2. We also implemented a full Molière multiple scattering simulation, which takes into account the single scatter tail at the cost of slowing down the simulation by a factor of ∼5. All results presented in this work were produced using the scaled Highland formula.
Energy straggling is simulated with a Vavilov-Gaussian model. Fast sampling from Vavilov distributions is achieved using Chibani's method [18]. The mean energy loss in each step is found by cubic-spline interpolating the range-energy table. For a material m with −1000 < HU < 3060, the CSDA proton range R m (E) is derived from the range in water, In this work the kinetic energies of δ-ray electrons are assumed to be deposited locally.
Given the typical size of a voxel (∼1 mm), this approximation holds very well for soft tissue and bone.

Transport mechanics
The step size ∆x is initially picked as the minimum of three quantities: (1) the distance to the next elastic or non-elastic nuclear interaction, λ, which is selected according to the total nuclear cross-section (2) the distance to the next voxel boundary, d, and (3) the range of the proton, R(E). We use a 'random hinge' algorithm [19] to take into account the small transverse displacement after each multiple scattering step.
If ∆x = λ, a nuclear collision occurs, and the proton is propagated to the collision point.
A random number is thrown to determine if the collision is elastic or non-elastic. The proton history is terminated if one of three conditions are met: (1) ∆x = R(E), which means that the proton is stopping in the current voxel, (2) E ≤ E cut , where E cut = 0.1 MeV, and (3) the proton is outside the boundaries of the voxelized geometry.

Non-elastic nuclear interactions
Non-elastic proton-nucleus cross-sections for all relevant nuclei in the energy range 0-250 MeV were tabulated using Geant4.9.6p2. Once a non-elastic collision is deemed to happen, a random number is thrown and the target is picked according to the relative sizes of the non-elastic cross-sections of each nuclear component. After picking a random point of entry into the target, the proton is propagated through nuclear matter. This is repeated until an INC involving at least one Pauli-allowed nucleon-nucleon collision occurs within the nucleus.
We very briefly summarize the INC and evaporation simulations below (for further details see ref. [13]).
The nucleus is assumed to be a two-component Fermi gas of neutrons and protons. The nucleon density follows a Woods-Saxon distribution that is approximated by 15 constant density shells. The incident proton and all collision products are tracked in the nuclear matter, until they exit the nucleus or become trapped. Pauli blocking is enforced by requiring the kinetic energies of all collision products to exceed their Fermi energies. Reflection and refraction at the potential shell boundaries are taken into account. In the therapeutic energy range (E < 250 MeV), only elastic collisions between nucleons need to be considered.
Total and differential elastic nucleon-nucleon cross-sections are calculated using the parameterization of Cugnon et al. [20]. The result of the INC simulation is a number of outgoing nucleons and a residual nucleus that subsequently de-excites by particle evaporation.
To model the evaporation phase, we use the generalized evaporation model (GEM) [21].
The angular distribution of evaporated particles is assumed to be isotropic in the rest frame of the nucleus. Other post-cascade processes such as photon evaporation, pre-equilibrium emission of particles, fission and Fermi break-up are currently not considered. Only protons produced in the INC and evaporation stages are stored for subsequent propagation. Neutrons are presently assumed to be dosimetrically irrelevant, and the kinetic energies of all charged particles heavier than protons are locally deposited.

Coherent nuclear elastic interactions
We simulate coherent nuclear elastic collisions with all constituent nuclei in the patient geometry. Collision kinematics are fully relativistic. Total elastic cross-sections σ el are obtained from Geant4.9.6p2. Scattering angles θ in the center-of-mass frame are sampled according to the following parameterization of the elastic differential cross section by Ranft [22]: where A is the mass number of the target and the invariant momentum transfer t in (GeV/c) 2 is given by t = −2p 2 (1 − cos θ), p being the momentum of the scattered proton in GeV/c. we verified that probability distributions of kinematic variables for all relevant off-axis beamspots were essentially identical to those for central-axis spots. This significantly reduces the size of input phase-space data, since central-axis particle lists can be used to model off-axis beam spots. However, if the nozzle includes a range shifter, an energy reduction has to be applied to E to account for the additional energy loss of protons in off-axis spots 4 . This energy correction can be computed knowing the particle direction and the range shifter thickness.

Simulation of proton beam spots from scanning nozzle
The phase-space files are pre-loaded into our MC at initialization time, and protons are generated at the nozzle exit. In practice, the number of primary proton histories in a given calculation has to be scaled down to account for losses due to nuclear interactions in the nozzle; otherwise the dose deposited in the phantom or patient would be slightly overestimated. This scaling factor was calculated for each nozzle configuration and beam energy. The protons are then are propagated through air to the edge of the volume defined by the CT image in a single step, taking into account multiple scattering, energy straggling and nuclear interactions.

B. GPU implementation
We now describe our implementation of the above model on a GPU using the CUDA C framework [24]. The reader is assumed to be acquainted with GPU terminology; if not, he or she is referred to the abundant resources available on the NVIDIA developer website [25].

Parallelization approach
Many different strategies have been used in the past to parallelize photon, neutron and charged particle transport MCs on the GPU. Although no two simulations are identical, one can identify two broad paradigms: the 'one step per thread', and 'one particle history per thread' methods.
In the first method (e.g. refs. [26,27]), the simulation of particle histories is split into basic components (such as track stepping, boundary crossing and discrete physical interactions), which are accumulated and processed in parallel by kernels that contain little or no branching instructions. As shown by Du et al. [27] this strategy can result in a divergence-free simulation because all threads in a warp are made to execute the same set of instructions.
The second method is the most commonly adopted approach. Here, a GPU thread processes a particle history until end conditions are satisfied. Since particle histories differ, thread divergence is inevitable, especially if more than one particle type are handled by a warp, or if a particle and its secondaries are tracked by the same thread. However, this method potentially results in a much reduced number of global memory transactions. For example, at the end of each step the particle position, direction and energy needs to be updated. These variables can be read from global to register memory, updated during track stepping and written back to global memory when end conditions are met. In contrast, in the 'one step per thread' method, processing each track step would require at least one global memory read and write per kinematic variable, since one kernel launch occurs per step.
Comparing the two methods for a neutron transport MC, Du et al. [27] find the 'one step per thread' method to be ∼10 times slower. In their implementation, the benefits of a divergence-free simulation is overwhelmed by severe global memory access latency. In this work we therefore follow the 'one history per thread' approach. We limit the effects of thread divergence to some extent by grouping non-elastic events and processing them in parallel.
Our code structure is discussed next.

Software structure
We implement the following five kernels: (1) a phase space kernel to set the initial kinematic variables of the protons in order to simulate realistic beam spots ( §II A 6), (2) a transport kernel to evolve the proton history step by step through the voxelized geometry,  (3) and (4) are described in further detail in ref. [13]. We also use CURAND kernels [28] for on-the-fly random number

GPU memory allocation and management
Read-only data such as phase space kinematic variables, interaction cross-sections and stopping power and range tables are stored in as 1-D textures. Constant memory is used for storing physics and mathematical constants. Shared memory is barely used in this work. CT geometry information and the 3-D dose tallies of each primary proton group are kept in GPU global memory. All dose tallies need to be updated 'atomically' to prevent race conditions when multiple threads are updating the dose deposited to the same voxel. For any proton group, a single dose tally was found to be sufficient on Kepler cards 5 . Global memory is also used to store stacks of particles and excited nuclei, using structures of arrays (SoAs) to group together information relevant to each entity (e.g. a SoA of particles contains pointers to position, direction and kinetic energy arrays on device memory). Using SoAs guarantees coalesced global memory transactions. To minimize global memory traffic, SoAs are read and updated once per kernel call. For example, in a transport kernel call, information contained in the particle SoA is read into register memory, modified many times during particle tracking and written back to the SoA at the end of the call.
As seen in fig. 2  The workflow that we adopted for recalculating plans from a commercially available TPS is shown in fig. 3. DICOM files (plan, structure and image) outputted by the TPS are processed using a set of Python scripts, to produce inputs for our MC. This step involves trimming the CT image to reduce computational time: using information contained in the structures file, a smaller rectangular volume containing the patient geometry is defined, so that air-filled voxels in the beam path are removed as much as possible. Beam spot weights, isocenter positions, gantry and couch rotation angles, and details of the nozzle configuration are also read and outputted in text format. A 'bit mask' file specifying whether or not each voxel in the reduced volume is within a user-specified list of structures is created.
This information allows the user to override the HU values of all voxels inside or outside a given structure, and to restrict dose file output to certain structures only. This considerably reduces GPU-CPU data transfer and file output times at simulation end.

D. Validation
To validate our nuclear model, we simulated non-elastic proton collisions on a large number of therapeutically relevant nuclei, with incident energies between 70 and 230 MeV.
We compared our predictions with Geant4.9.6p2 (using the Bertini and Binary cascade models) and published measurements when available.
Using our MC, we simulated infinitesimally narrow pencil beams of protons with energies between 70 and 230 MeV, in various homogeneous voxelized phantoms including water, soft tissue, dense bone and pure calcium and titanium. The voxel size was 1 mm along all directions. The simulations were also carried out with Geant4.9.6p2, using the QGSP BIC EMY [33] physics list. All particles were terminated once they exit the phantom, so as to eliminate any in-scattering contributions from outside the region of interest. These pencil beam simulations allowed us to verify and refine our implementation of the physics models for proton transport. We also simulated square fields of varying sizes in water at several beam energies, for different treatment nozzle configurations. These square field simulations were repeated with TOPAS, using the nozzle model described in §II A 6.
To verify our treatment plan calculations, we simulated three complex head and neck cases involving different patients using both TOPAS and our GPU-based MC. The TOPAS simulation again included the nozzle model described in §II A 6. Both simulations adopted the native CT voxel size (1.25 × 1.25 × 2.5 mm 3 ) for particle transport and dose scoring.
To quantify differences between the two MC we use the 3-D gamma index [34], which is also computed on the GPU. However, the 3-D gamma index might not be sensitive to systematic biases that are smaller than the pass criterion. To investigate these, we examined distributions of the percentage dose difference, defined for a given voxel as: where the dose calculated by TOPAS and our MC in that voxel are D G4 and D GP U , respectively. We also define the quantities ∆ G4−G4 and ∆ GP U −GP U . The latter is defined as: where D GP U is the dose calculated by our MC for the same number of protons, but using a different starting random number seed, and σ GP U stat is the average percentage statistical uncertainty on the dose in the voxel. σ GP U stat is estimated by dividing the total number of primary protons in 10 groups, as explained in §II B 2. ∆ G4−G4 , which measures the statistical error in the TOPAS simulation, is defined in a similar way.
Ideally, 1-D distributions of ∆ G4−GP U , ∆ G4−G4 and ∆ GP U −GP U in the targets should be Gaussians peaked at zero, with identical widths. Biases and other model discrepancies result in peak shifts and RMS differences.  [20] and have been scaled down for display clarity (except for the topmost curves). In general, reasonable agreement is obtained with both Geant4.9.6p2 models and published data. The based MC, and (5) Our GPU-based MC, compiled with the -use fast math flag 6 . We used a i7-3820 3.6 GHz processor for the CPU tests, and the GPU calculations were performed on a single NVIDIA GTX680 card. It is seen that our MC can be ∼50 times faster than the Geant4 Bertini model. It was verified that identical results were obtained in cases (3), (4) 6 Compiling with the -use fast math flag enables the use of faster, but less accurate functions for standard mathematical operations on NVIDIA GPUs. using the Geant4 dose maps as the reference, the 3-D gamma pass rate at 2%-2 mm was 100% for all beam energies and phantoms investigated. In particular, our MC correctly models the transverse dose tails, down to the regime where nuclear elastic scattering dominates. The square field simulations also demonstrated very good agreement between Geant4 and our MC. These calculations were performed to test our ability to compute large field sizes, and to correctly position and model beam spots. For brevity, we won't show these results, since the treatment plan recalculation results below imply that these requirements were met. Distributions of ∆ G4−GP U , ∆ G4−G4 and ∆ GP U −GP U in the high-dose target are Gaussian, as shown in the right plot in fig. 8. We observe that the distribution of ∆ G4−GP U is slightly off-center, i.e. D GP U is on average 0.2% lower than D G4 . This small dose deficit is attributed primarily to neutrons and de-excitation gammas originating from nuclear interactions in the nozzle and in the patient, and which are not currently propagated in our MC simulation.

A. Non-elastic interactions
Gaussian fits to the three curves yield RMS values of 2.2, 2.0 and 1.9, respectively. The value of 1.9 is consistent with the estimated value of σ GP U stat = 1.3% (see Eq. 5). The percentage statistical error in the TOPAS simulation is slightly higher (σ G4 stat ∼ √ 2%). This is possibly caused by the difference in energy straggling implementations between TOPAS and our MC.
Note that in our simulation the initial number of primary proton histories has been scaled down relative to TOPAS to account for nuclear interactions in the nozzle (see §II A 6). σ 2 total , the variance of the distribution of ∆ G4−GP U , can be written as: where σ 2 model is the additional variance due to model differences. From the above numbers, σ model ∼ 1%. We hypothesize that the main contribution to σ model is due to the handling of δ-rays: TOPAS propagates these particles, while in our simulation the kinetic energy of electrons is locally deposited.
For the two other head and neck plan re-calculations, the 3-D gamma pass rates at 2%-2mm were 97.8% and 98.9%. Very good agreement was again obtained between the DVHs generated from our MC and TOPAS, and dose difference distributions followed the same trend. Hence, for brevity we will not discuss these two plans in further detail.
The computational times for the three head and neck plans on two types of NVIDIA cards are shown in table II. We verified that simulations performed with and without the -use fast math flag were equivalent. It is seen that our MC on a single K20Xm card can be up to ∼200 times faster than TOPAS on a 100-node cluster. Evidently, TOPAS performs a more detailed calculation, but as shown above, dosimetrically the results from the two simulations are very close.   fig. 3, allows the MC-calculated dose to be imported into the TPS for convenient evaluation and display.
2. In addition, we have developed a GPU-based intensity modulated proton therapy (IMPT) optimization code that adopts our MC as the dose computation engine [35].
For a complex three-field head and neck plan, the full calculation, including the initial dose map for all relevant beam spots, is estimated to take half an hour on a cluster with 24 GPUs. Therefore, using GPU acceleration, MC-based IMPT planning can be clinically applicable on a routine basis. We are currently upgrading this code to a robust MC-based IMPT optimization system.

B. Future applications
We are planning the following two improvements to our physics model: simulation of the production and transport of δ-rays, and propagation of secondary neutrons from non-elastic nuclear interactions of protons in the patient. The ability to evolve neutron histories will allow us to make very fast and accurate neutron dose calculations. As discussed above, one of the main strengths of this work is the ability to carry out detailed nuclear interaction simulations very rapidly on the GPU. It was shown in ref. [36] that secondary protons have a significant impact on the LET d distribution in clinical proton beams. In fact, our MC is capable of computing physical dose and LET d simultaneously; this does not considerably extend the net computational time. The LET d calculations in water are in good agreement with TOPAS. Fig. 9 shows our prediction of the LET d distribution in areas with at least 10% of the maximum dose, in the same coronal plane as in fig. 7.
As expected, LET d values are higher in regions where the protons are ranging out. At present, our LET d values neither include the small contributions of proton recoils from scattering of secondary neutrons, nor heavy ion recoils. The neutron scattering portion will be taken into account after the addition of a neutron transport kernel to our MC. The ability to perform MC-based LET d calculations rapidly and accurately on the GPU opens very exciting prospects for biological treatment planning.

V. CONCLUSIONS
To summarize, we have successfully developed a GPU-based proton transport MC that incorporates Bertini cascade and evaporation kernels for modeling non-elastic interactions of protons with any nucleus in the therapeutic energy range. We have verified our MC extensively using Geant4 and TOPAS. The net computation speed is typically ∼20 s per 1 × 10 7 proton histories, which is comparable to processing times reported by previous authors using simulations with simpler nuclear interaction models. The very fast calculation speed allowed this MC to be used in our clinic as the central component of a treatment plan verification system, and also as the dose calculation engine for MC-based IMPT optimization.
Furthermore, the detailed nuclear modeling not only gives us greater confidence in our physical dose calculations, but will allow us to perform accurate GPU-based MC calculations of LET d , as well as fast neutron dose estimates.

VI. ACKNOWLEDGEMENTS
This work was partially funded by a grant from Varian Medical Systems, Inc. We are grateful to our colleagues at Mayo Clinic for carefully reading the manuscript.