A Performance Evaluation of OpenCL and Intel Cilk Plus on a Graphic Rendering Problem

__________________


Introduction
Speed and performance have become the critical factors in defining computing systems and the Internet in this contemporary time. Computers have moved from being embedded with just a single processor to Original Research Article multiple processors over the years. This is all in a bid to increase speed and performance. Parallel computing is one of the most exciting technologies to achieve prominence since the invention of electronic computers [1]. Parallel computing can be seen as the use of a parallel computer to reduce the time needed to solve a single computational problem [2]. Many scientific and technological tasks today demand high computing power to solve. One of such areas is the area of graphic rendering. This research exploits the power of parallel computing in improving the performance of ray tracing algorithm in the domain of graphic rendering.
The rise of multicore is bringing shared-memory parallelism to the masses. The community is struggling to identify which parallel models are most productive [3].
However, with this proliferation of parallel computing devices comes the challenge of programmability. Programmers must consider data dependencies, race conditions, communications, etc. As the available parallel computing resources go beyond just the CPU, even more programming complexities arise. To access all available resources, the same routine or algorithm may need to be coded multiple times and in multiple ways. The programmer must consider various type-, vendor-, or platform-specific programming models and/or APIs [4].
A single source code that could be portable across all platforms is a challenge. A better way is needed. In 2008, Apple Computer proposed a draft specification for OpenCL (Open Computing Language) [5] to the Khronos Group. The Khronos group explains that OpenCL lets programmers write a single portable program that uses all resources in the heterogeneous platform [6]. The Khronos Group also asserts that OpenCL is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. OpenCL presents programmers with a standard API with which to program; participating vendors write their device drivers to conform to its specification. This abstraction enables programmers to focus less on device-specific programming details. Also, OpenCL works on different hardware, but the software needs to be adapted for each architecture [7].
Intel Cilk Plus is the easiest, quickest way to harness the power of both multicore and vector processing. It is an extension of C and C++ programming languages, designed for multithreaded parallel computing [8]. On July 31, 2009, Cilk Arts, producers of the Cilk++ programming language, announced that its products and engineering team were now part of Intel Corporation. Intel and Cilk Arts integrated and advanced the technology further resulting in a September 2010 release of Intel Cilk Plus. To further make parallel programming easier, Phoronix [9], said that Intel is planning on implementing the Intel Cilk Plus C and C++ extensions to the GNU Compiler Collection (GCC). This is already a reality as GCC 4.9 has Intel Cilk Plus implemented. Different strategies have evolved over the years on how to develop a parallel application. According to Luis [10], there are three (3) basic strategies: a. Strategy 1: Automatic Parallelization, whose ultimate goal is to relief programmers from parallelizing task. It takes rough codes and produces efficient parallel object code with little or no additional work by the programmer. The strategy is described in Fig. 1.1: b. Strategy 2: Parallel Libraries: This approach has been more successful than the previous one. The basic idea is to encapsulate some of the parallel code that is common to several applications into a parallel library that can be implemented in a very efficient way. Such a library can then be reused by several codes. This is described in Fig. 1   OpenCL and Intel Cilk Plus are fast taking center stage in parallel programming and this research carries out a performance evaluation of the two in the domain of parallel graphic rendering.

Research Objective
The objectives of this research are as follows: i. To design a parallel raytracing engine using OpenCL and another Using Intel Cilk Plus.
ii. To compare the performance of OpenCL implementation against that of Intel Cilk Plus.

Materials and Methods
In this paper a detailed study into graphic rendering and the impact of high performance computing was carried out. The performance of two (2) key parallel programming tools, OpenCL and Intel Cilk Plus were tested on raytracing algorithm for graphic rendering All the implementations were ran several times without restarting and the timing for each implementation was stored in separate files. These results were then put in tables and then used to plot graphs in order to analyze and draw conclusions. It is important to reiterate here that the focus of this research work is on the OpenCL and Intel Cilk Plus implementations. The Implementations were run 10 times without restarting at different rendering depths of 0, 1, 2, 3, 4 and 16.

Results
Here we examine the implementations at the different ray-tracing depths on all the three (3) test computers.

a. Ray-tracing Depth of 0:
The results obtained at this depth are as shown in Fig. 1.4. Fig. 1.4 clearly shows that the different implementations have slightly different behaviors on the different test computers. The implementations on PC One and PC Three show a more uniform behavior compared to that of PC One, at rendering depth of 0, where there is theoretically no ray fired into the rendering scene. It is the point of minimal computational demands. It was noticed that at first run across all the PCs OpenCL showed a rather higher run time, this can be attributed to the initial time OpenCL takes to setup its computing devices, context and command queues. At this depth Intel Cilk Plus seems to be performing better than OpenCL. Overall, these differences can be attributed to the memory management of the Operating Systems and hardware specifications.
PC One which has an AMD processor showed a rather haphazard behavior especially on the recursive implementation. This can be explained by the fact that recursive algorithms utilize more memory than iterative ones. Also, the processor in PC One is much slower than those in PC Two and PC Three. At this depth there are no secondary rays due to reflection in the scene. Here Intel Cilk Plus implementation showed varied results on the three (3) test computers but the OpenCL CPU implementation showed a more consistent behavior across the test computers. It was noticed that at first run (RUN = 1) OpenCL takes a lot of time but stabilizes from the second run (RUN = 2). Again, this can be attributed to the time it takes OpenCL to setup the computing devices, context, and command queues. Clearly, OpenCL CPU performed far better than Intel Cilk Plus implementation at this ray-tracing depth.
c. Ray-tracing Depth of 2: Fig. 1.6 shows the performance of the implementations on the three (3) test PCs.
Here again, OpenCL CPU performed better than Intel Cilk Plus in all test computers. The C++ Iterative and OpenCL CPU implementations show more consistency in all the test computers but the C++ Recursive and Intel Cilk Plus showed different behaviors. Perhaps this is due to the impact of the Operating System and the hardware specifications especially demands on memory. d. Ray-tracing Depth of 3: Fig   Fig. 1.7 shows the rendering result at ray CPU implementation still performs better than Intel Cilk increases OpenCL CPU still remains consistent and stable. This is slightly not so with the Intel Cilk Plus implementation.

Fig. 1.
PC Three seems to give a more accurate result compared to PC One and PC Two. This is because a parallel program is expected to be faster than its serial version. Perhaps there are still some race conditions in the Intel Cilk Plus implementation or perhaps implementation will perform better.

e. Ray-tracing Depth of 4: Fig
Here the consistency of OpenCL cannot be equaled across the test computers. Again PC Three gave a more acceptable result because it has more computing device clocked at 2.4GHz and the more computing device the more parallel implementations perform better to a certain limit.  shows the rendering result at ray-tracing depth = 3 on the three (3) test computers. Here OpenCL CPU implementation still performs better than Intel Cilk Plus and it seems like as computational demands increases OpenCL CPU still remains consistent and stable. This is slightly not so with the Intel Cilk Plus PC Three seems to give a more accurate result compared to PC One and PC Two. This is because a parallel program is expected to be faster than its serial version. Perhaps there are still some race conditions in the Intel Cilk Plus implementation or perhaps as the number of processors increase, the Intel Cilk Plus Fig. 1.8 shows the performance at this ray-tracing depth. cannot be equaled across the test computers. Again PC Three gave a more acceptable result because it has more computing device clocked at 2.4GHz and the more computing device the more parallel implementations perform better to a certain limit. Fig. 1.9 shows the performance at ray-tracing depth of 16 the maximum depth for our implementation.
Again, PC Three gives a more realistic result at ray-tracing depth of 16. Also, OpenCL CPU still out pth. It was observed that the average time of run increased as the rendering ; Article no. BJMCS.19422 7 shows the performance of the implementations on the test PCs. tracing depth = 3 on the three (3) test computers. Here OpenCL Plus and it seems like as computational demands increases OpenCL CPU still remains consistent and stable. This is slightly not so with the Intel Cilk Plus PC Three seems to give a more accurate result compared to PC One and PC Two. This is because a parallel program is expected to be faster than its serial version. Perhaps there are still some race conditions in the as the number of processors increase, the Intel Cilk Plus cannot be equaled across the test computers. Again PC Three gave a more acceptable result because it has more computing device clocked at 2.4GHz and the more computing device tracing depth of 16 the maximum tracing depth of 16. Also, OpenCL CPU still out pth. It was observed that the average time of run increased as the rendering

Conclusion
The research shows that at all the varied depths, OpenCL showed more consistency after the first run without restarting than Intel Cilk Plus. Also, across all rendering depths OpenCL seems to take longer time at first run before evening out. Overall a more acceptable result is that of PC Three because parallel implementations are expected to outperform serial ones especially in an embarrassingly parallel problem like we have in this research. It would be interesting to see the performance of OpenCL and Intel Cilk Plus implementation of ray-tracing in a massively parallel system.
Finally, it was observed that as the ray-tracing depth increases the performance difference between the parallel and the serial implementations widens as seen in PC Three but between the two parallel implementations, OpenCL performed better than Intel Cilk Plus from these results.