Fast calculation of computer-generated-hologram on AMD HD5000 series GPU and OpenCL

In this paper, we report fast calculation of a computer-generated-hologram using a new architecture of the HD5000 series GPU (RV870) made by AMD and its new software development environment, OpenCL. Using a RV870 GPU and OpenCL, we can calculate 1,920 * 1,024 resolution of a CGH from a 3D object consisting of 1,024 points in 30 milli-seconds. The calculation speed realizes a speed approximately two times faster than that of a GPU made by NVIDIA.


Introduction
CGH (Computer-generated-hologram) has the ability to correctly record and reconstruct a light wave for a 3D object. Electroholography [1] using the CGH technique is attractive as a 3D display, because the CGH technique has remarkable features ; however, due to two significant problems, it is difficult to develop a practical 3D display system using electroholography. One problem is the need for an SLM (spatial light modulator) that can display a CGH with large area and high resolution, because the resolution of a CGH is that of wavelength-order [2,3,4,5]. The other problem is the enormous computational time required for generating a CGH. This paper focuses on this problem.
Assuming that a 3D object is composed of N point light sources, the formula for computing a CGH is expressed as: where, I(x h , y h ) is the light intensity on a CGH, (x h , y h ) and (x j , y j , z j ) are the coordinates for the CGH and a 3D object, A j is the light intensity of the 3D object, λ is the wavelength of the reference light, and P j = πp 2 /(λz j ), where p is the sampling interval on the CGH plane. Note that the coordinates (x j , y j ) and (x h , y h ) are normalized by p. The computational complexity of the above formula is O(N N x N y ), where N x and N y are the horizontal and vertical sampling numbers of the CGH. This creates very enormous computational complexity.
To solve this problem, several software approaches have been proposed: for example, recurrence approaches [6,7,8], and the look-up table methods [9,10,11]. Another approach to dramatically reduce the calculation time is the hardware approach, such as FPGA (field-programmable gate array) and GPU (graphics processing unit). We have designed and built special-purpose computers for holography using FPGA technology, called HORN (HOlographic ReconstructioN) [12,13,14]. The FPGA-based approaches showed excellent computational speed, however, the approach has the following restrictions: the high cost for developing the FPGA board, long development time and technical know-how required for the FPGA technology.
GPU-based approaches have already been applied to the optics field. Especially, CGH calculations [15,16,11,17] and reconstruction calculations in digital holography [18,19] are used to accelerate the calculation. In 2007, NVIDIA released a new architecture of GPU and its software development environment, CUDA (Compute Unified Device Architecture). Using CUDA allows us to program GPU easier than prior software developments, such as HLSL, Cg language and so forth. Since the release, many papers using NVIDIA GPU and CUDA have been published in optics.
On the other hand, more recently in December 2009, a new GPU of the HD5000 series (RV870) made by AMD was released. The RV870 GPU has new architecture and its software environment, OpenCL (Open Computing Language). The architecture of the RV870 GPU is different from that of the NVIDIA GPU. The RV870 GPU has huge potential for fast calculation because one GPU chip has over 1,000 floating-point number processors, while one NVIDIA GPU chip has about 200 floating-point number processors. However, fast CGH calculation using the RV870 GPU has not been reported so far.
In this paper, we report fast CGH calculation using RV870 GPU and OpenCL. Using these, we can calculate 1, 920 × 1, 024 resolution of a CGH from a 3D object consisting of 1, 024 points in 30 milli-seconds. To the best of our knowledge, this article is the first report of using the RV870 GPU and OpenCL in optics. In addition, we compare the calculation performance between the RV870 GPU and the GPU made by NVIDIA.
In Section 2, we describe a fast CGH calculation on AMD RV870 and OpenCL. In Section 3, we show and compare the performance between the RV870 GPU and the GPU made by NVIDIA. In Section 4, we conclude this work.
2 Fast calculation of computer-generated-hologram on AMD RV870 and OpenCL The architecture of RV870 GPU is shown in Fig.1. The top level of the GPU consists of many SIMD (Single Instruction Multiple Data) engines. The SIMD engine has 16 thread processors (TP) and a shared memory, which is small and high-speed.
In addition, the thread processor has four stream cores and one T-stream core. The stream core is a simple floating-point-number operation unit. And, the T-stream core also has a floating-point-number operation unit and special function unit. The special function unit can calculate special functions at high speed, such as trigonometric function, logarithm function and so on. The stream cores in the same SIMD engine operate by the same instructions; therefore, the SIMD engine is similar to a SIMD processor.
Calculation on the GPU using OpenCL is executed using the following steps: (1) We initialize a GPU using OpenCL API (Application Program Interface) functions. (2) We allocate the required amount of memory on a device memory in Fig.1. The device memory is large amount, but large latency access of memory. (3) We send an input data to the device memory. (4) We send a kernel function from the host computer to the GPU. The kernel function is compiled to native code of GPU using the OpenCL compiler. The GPU executes the kernel function. (5) We receive a calculated result from the device memory. (6) We release the device memory and GPU resources. Figure 2(a) shows the outline of the CGH calculation on the RV870 GPU with OpenCL. When calculating a CGH with the resolution of N x × N y , we need to divide the CGH area into groups with the size of T x × T y . Therefore, the number of groups is N x /T x × N y /T y . In addition, each group has T x × T y items (In the CUDA, group and item are equivalent to block and thread, respectively). Each group is allocated to SIMD engines and each item simultaneously calculate Eq.(1) by each stream core on an SIMD engine.
In Fig.2(b), we show the kernel source code of the CGH calculation on the RV870 GPU with OpenCL. The source code is not optimized because we understand it easily. The optimization is shown in the next subsection.
Each group and item have the indices, group id and local id. The OpenCL functions, get group id(0) and get group id(1), give us the horizontal and vertical indices of group ids respectively. The OpenCL functions, get local id(0) and get local id(1), also give us the horizontal and vertical indices of local ids respectively.
The arguments of the kernel function are a CGH data (d hol), an object data (d obj), the number of object points (N ) and the CGH size (N x , N y ). An object data (d obj) consists of the coordinates and the intensity as four float data (f loat4). In lines 5, 6 and 7 of the Fig.2(b), the variables x and y calculate the coordinates (x h , y h ) on the CGH plane and adr calculates the address of the device memory for storing the calculation result I(x h , y h ). In lines 11 to 16, a CGH point I(x h , y h ) can be calculated by iterating for N . Although seeming to execute only one kernel, in fact, each stream core corresponding to local id and global id can perform the kernel in parallel.
When calculating a CGH with 1, 920× 1, 024 from a 3D object composed of 1, 024 points, the kernel with T x × T y = 16 × 16 took about 215ms. The calculation speed of the kernel is slow.

Optimization
The previous source code is not optimized. In this section, we optimize the previous source code to obtain more acceleration speed. Figure 3 shows the optimized kernel function from Fig.2(b).
We proposed a fast CGH computation method using two recurrence formulas [8,12,13,14]. Our recurrence algorithm can compute the phase component of the cosine function in Eq.(1) by two recurrence formulas. The recurrence algorithm is as follows: Here, we define Γ 0 = P j ((x h − x j ) 2 + (y h − y j ) 2 ), δ 0 = P j (2(x h − x j ) + 1), ∆ = 2P j . Eventually, we can compute the phase Γ n at the next coordinate by the two recurrence formulas. For more details, see Ref. [8] In lines 15 to 18, we copy the object data from the device memory (d obj) to a shared memory (s obj). The shared memory can store 256 object points at a time because the shared memory is small and highspeed. Therefore, in the 13, we must iterate N/256 times. Note that barrier(CLK LOCAL M EM F EN CE) means a barrier synchronization in line 18. It is equivalent to the syncthreads function in the CUDA.
Loop unrolling is a well-known technique for optimizing a kernel function. It can be realized by reducing the number of iterations and replicating the body of the loop. Benefits of the loop unrolling are the capable to decrease the loop frequency, branch instructions and conditional instructions. In the optimized kernel, we applied the loop unrolling to the loop of object points. In lines 20 to 51 in Fig.3, we can perform four object points per one iteration of the loop. In addition, we vectorize the operations in the loop using the f loat4 type, in order to handle four object points at a time. For example, in line 22, we can calculate the four subtractions simultaneously. In the same way, the kernel can handle eight CGH points using the f loat8 type at same time in lines 40 to 50.
In lines 42 to 45, we used native cosine functions, instead of the normal cosine function shown in Fig.2(b). The native cosine function can compute the fast cosine function using the hardware.  Table 1 shows a comparison of the calculation times for a CPU alone, NVIDIA GPU and an AMD RV870 GPU. The size of the CGH is 1, 920 × 1, 024. The specifications of the personal computer are as follows: Intel Core 2 Quad Q6600 (We used one core for the calculation), 2 GB of memory, Microsoft Windows XP SP3. We used a GeForce GTX260 as the NVIDIA GPU board and its software development environment of CUDA version 2.3, and a RADEON HD5850 as the AMD GPU board and its software development environment of StreamSDK version2.0. The RADEON HD5850 GPU has 1,440 stream cores (namely, 18 SIMD engines) with the clock frequency of 725MHz.

Results
We can see that the optimization method for the AMD GPU described in Section 2.1 can perform more than ten times faster than that without the optimization. In the calculation times for the NVIDIA GPU in the table, we optimized the kernel for the NVIDIA GPU using the same method as described in Section 2.1: namely, recurrence algorithm, shared memory, loop unrolling, vectorization, native instruction. And, in the calculation times for the CPU alone in the table, we used Eq.(2) for the CGH calculation. All calculation times using AMD and NVIDIA are superior to those using the CPU alone. In addition, the AMD GPU can calculate a CGH approximately two times faster than the NVIDIA GPU.

Conclusion
In this paper, we described a fast CGH calculation using an AMD RV870 GPU with new architecture and its new software development environment, OpenCL. Many fast CGH calculation methods using a NVIDIA GPU and the CUDA have already been reported in optics field; however, a study using the RV870 GPU has not been reported so far. To the best of our knowledge, this article is the first report of using the RV870 GPU and OpenCL in optics. Using the RV870 GPU and OpenCL, we can calculate 1, 920 × 1, 024 resolution of a CGH from a 3D object consisting of 1, 024 points in about 30 ms. The calculation speed can realize approximately two times faster than the NVIDIA GPU.