GPU-accelerated ray-tracing for real-time treatment planning

Dose calculation methods in radiotherapy treatment planning require the radiological depth information of the voxels that represent the patient volume to correct for tissue inhomogeneities. This information is acquired by time consuming ray-tracing-based calculations. For treatment planning scenarios with changing geometries and real-time constraints this is a severe bottleneck. We implemented an algorithm for the graphics processing unit (GPU) which implements a ray-matrix approach to reduce the number of rays to trace. Furthermore, we investigated the impact of different strategies of accessing memory in kernel implementations as well as strategies for rapid data transfers between main memory and memory of the graphics device. Our study included the overlapping of computations and memory transfers to reduce the overall runtime using Hyper-Q. We tested our approach on a prostate case (9 beams, coplanar). The measured execution times for a complete ray-tracing range from 28 msec for the computations on the GPU to 99 msec when considering data transfers to and from the graphics device. Our GPU-based algorithm performed the ray-tracing in real-time. The strategies efficiently reduce the time consumption of memory accesses and data transfer overhead. The achieved runtimes demonstrate the viability of this approach and allow improved real-time performance for dose calculation methods in clinical routine.


Introduction
The computation of the radiological depth data is a vital preprocessing step for radiotherapy treatment planning dose calculations. Since the depth data represents the patient geometry, dose algorithms can use it to correct for tissue inhomogeneities along incident rays. The computation of the radiological depth data relies on time consuming ray-tracing operations typically carried out for every voxel of the patient volume. In adaptive radiotherapy (ART) [1] [2][3] treatment planning scenarios with changing patient geometries, this is a severe bottleneck. Our group investigates a new IMRT treatment planning paradigm (interactive dose shaping) which requires rapid access to radiological depth information for changing patient geometries [4] [5]. GPUs are powerful high core count devices that are no longer solely used for graphics applications, but also for general-purpose computing tasks. Compared to CPUs, their computing cores are weaker but the massive amount of these cores results in a significant performance increase. Also, GPUs offer memory bandwidth that is about one order of magnitude higher compared to single-socket CPUs. However, GPUs only excel in performance if (1) their vast amount of computing cores can be kept busy, and (2) if they can perform the computations incore, i.e. if the input and output data associated with the computation is stored in the on-device memory. Otherwise, performance can degrade due to data movements over the bandwidthlimited PCIe interface. In this work we implemented a GPU-based ray-tracing algorithm using CUDA [6]. We investigated the impact of different strategies to increase the memory access efficiency with respect to the runtime. This includes the potentially inefficient memory access patterns of the ray-tracing kernel functions to the on-device graphics memory as well as the overhead resulting from data transfers to and from the host's main memory. In addition we investigated the benefits of using GPU capabilities for concurrent kernel execution and data movement. Results are obtained for a prostate patient data set.

The ray-tracing algorithm
Our ray-tracing implementation is based on the algorithm proposed by Siddon [7], which was adapted for the GPU by utilizing the stepping approach in [8] and the ray-matrix approach in [9]. A ray-matrix is defined at the iso-center of the treatment field perpendicular to each incident beam. The size of the matrix is defined by the extension of the treatment field plus a scattering margin on each side of a beam. Each matrix element represents a point in 3D space. Rays are defined from the beam source, through each matrix element and tracing is performed until a ray exits the patient. The number of points is determined so that the distance of two neighboring rays at the patient's exit plane is equal to the voxel size. The ray-matrix approach can greatly reduce the number of ray-tracing processes in comparison to tracing every voxel of the patient volume for each beam. In addition, every ray can be traced independently from every other ray, which fits well to the CUDA programming model and allowed us to decomposed the problem so that one thread processes the ray-tracing along one ray. For each beam we execute one kernel which spawns a thread for each point in the corresponding ray-matrix.

Time-to-solution
We use time-to-solution (TTS) as metric to evaluate the performance of our implementation. We define TTS as the accumulated runtimes of the ray-tracing kernel functions on the GPU, the required input data transfer to the graphics device and the transfer of the computed radiological depth data cubes back to the host's main memory as illustrated in Figure 1.

Acceleration strategies
GPUs employ an explicit memory hierarchy that is exposed to the user using NVidia's CUDA framework. Accessing data from e.g. global or texture memory has an impact on the utilization of the computational resource of the graphics hardware. The global memory is the main memory of NVidia graphics devices. It is cached (although not to reduce access latency, but to reduce contention) and reaches best performance for coalesced access patterns, i.e. when threads access consecutive memory addresses. In contrast, texture memory is optimized for accessing grid data. It reaches best performance for access patterns based on spatial locality. Our initial implementation of the ray-tracing kernel, the global memory kernel (GMK), accesses the input data stored in global memory. As acceleration strategy (1) we implemented a second version, the texture memory kernel (TMK), which accesses the input data provided as a 3D texture. This strategy focuses on the data transfer from graphics memory and computational units. The input and output data for the both the GMK and TMK kernel implementations is transferred to and from paged host memory. This can be a potential bottleneck as demand-paging might swap pages out, decreasing access times. To avoid this, the CUDA driver has to copy the data from a paged memory pointer to pinned memory pointer before it can invoke the Host to device (H2D) and device to host (D2H) data transfers using direct memory access (DMA). Allocating the host memory as pinned or page-lock memory permits demand-paging and therefore enables the GPU to access the host's main memory directly using DMA without copy overhead. Strategy (2) focuses on the data transfer between the host's main memory and the device memory. We enhanced our implementation for both, GMK and TMK kernel by using page-lock memory on the host side. The expected benefit is an approximately 2x higher memory bandwidth for the input/output data transfers and thus a reduction of the TTS. and memory transfers of multiple beams overlap which can contribute to reduction in TTS. Table 1 provides a quick overview of the different strategies and versions of the ray-tracing implementation.

Development & test environment
The proposed acceleration strategies were implemented in CUDA 5.0 and C/C++ using MS VisualStudio 2010. We tested the implementations on a Win7 workstation equipped with an Intel Xeon E5 CPU with 64 GByte RAM and an NVidia Tesla K20c with 5 GByte graphics memory.
We used a IMRT treatment plan for a prostate patient with 9 coplanar beams and 256 x 256 x 234 voxels of size (2.62 mm) 3 as test data set. The ray-matrix size for each of the 9 beams includes the beam extension plus an 8 cm scattering margin on each side. The ray-matrix dimensions range 140 x 140 to 186 x 186 points.
To test the accuracy of our GPU implementations used the serial CPU implementation of the ray-tracing algorithm used in [9] as reference.

Results
The results of our GPU implementations were compared to the reference implementation. Both produce equal results within single precision accuracy. The runtimes measured for the different versions of our ray-tracing implementation are depicted in Table 2. The columns 3 to 5 present the runtimes in msec for the Kernel execution on the GPU, the data transfer times H2D and D2H and the TTS. The runtimes are accumulated over 9 kernel invocations, a single data transfer of the input volume and the 9 data transfers of output volumes. For the Hyper-Q versions though, the kernel execution time represents the time from the first kernel start until the last kernel finished.  Figure 2. It allows an intuitive visual access to the work flow on the GPU and the impact of the implemented acceleration strategies. The golden boxes represent the time consumption of data transfers between host and graphics hardware. The aqua blue boxes show the runtime of the GMK and the purple boxes represent the TMK. The first golden box depicts the single input data transfer to the graphics device and the remaining ones account for the transfers of the output cubes to the host's main memory, one cube for each beam respectively.

Discussion
We investigated three acceleration strategies for enhancing ray-tracing performance for the radiological depth computation for real-time treatment planning. The application of the strategies can enhance the efficiency of the IMRT treatment planning process in particular when changing patient geometries are considered. We defined the TTS metric to assess the benefit of offloading the ray-tracing to the graphic device for our in-house TPS framework [4] [5]. However, many other applications require the results back in the host's memory and can benefit from the proposed strategies. Our first strategy (1) focuses on the impact of using texture memory as opposed to global memory for the ray-tracing kernel implementations. The TMK showed a better performance, since the ray-tracing memory access patterns are based rather on spatial locality than temporal locality. This difference is especially prominent for beam angles which require data access patterns not mapping well to caching strategies based on locality of reference. The profiler output shows reduced sizes of the purple boxes in comparison to the aqua blue boxes and thus indicates the TMK performance to be rather independent from the beam angle. For strategy (2), we enhanced the implementations by using page-lock host memory with both kernels. The performance gain is shown in the profiler output by reduced sizes of the golden boxes. The memory bandwidth observed was stable above 6 Gbyte/sec across the PCI-e interface which amounts to approximately a factor of 2 versus the implementations using paged host memory. The final strategy (3) leads to the largest reduction in TTS. It includes overlapping kernel executions and output data transfers back to the host using Hyper-Q. To find that both concurrent implementations for GMK and TMK achieved the same TTS was not expected. It indicates that the rather inefficient memory accesses using global memory implemented in the GMK were efficiently hidden using Hyper-Q.
Combining the strategies, we were able to compute the ray-tracing related tasks in 28 msec for 9 treatment beams in clinical resolution using graphics hardware of NVidia's current Kepler architecture. We exploited its capability of concurrent kernel execution and data movement. We obtained runtimes of 99 msec which corresponds to a speed-up of 3 in TTS in comparison to the initial implementation of GMK and TMK using paged host memory. Future work will include the integration of multi GPU support to study the scaling behavior of our implementation. Furthermore, we will investigate the application of the GPU-based ray-tracing in adaptive treatment planning scenarios for rapid response to changing patient geometries.

Conclusion
We have shown that the computation of radiological depth data for each beam and each voxel of a patient volume can be carried out in real-time using graphics hardware. The memory bandwidth between the host and the device is the limiting factor for ray-tracing on the GPU. The presented strategies minimize the time consumption of memory accesses and data movements using novel features of NVidia's Kepler GPU architecture. We found an optimal time-to-solution by combining our texture memory kernel with Hyper-Q.