OpenCL Altera SDK v.14.0 vs. v. 13.1 Benchmarks Study

Altera SDK for OpenCL allows programmers to write a simple code in OpenCL and abstracts all Field programmable gate array (FPGA) design complexity. The kernels are synthesized to equivalent circuits using the FPGA hardware recourses Adaptive logic modules (ALMs), DSPs and Memory blocks. In this study, we developed a set of fifteen different benchmarks, each of which has its own characteristics. Benchmarks include with/without loop unrolling, have/ have not atomic operations, have one/multiple kernels per single file


Introduction
OpenCL stands for Open Computing Language, which is an open framework for parallel programming executed across heterogeneous platforms CPUs, GPUs and DSPs. 1 OpenCL programming model consists of two programs; first, host program, which is usually written in C/C++, and it is responsible for loading the OpenCL programs, memory management, data transfer and errors checking. 2econd program is the device code, which is written in OpenCL, and can be run on the available devices such as GPUs, DSPs, or FPGAs.
In OpenCL, kernel could be executed by a large number of work-items (threads).Work-items are organized in one, two or three dimensions, and are divided into blocks which can be multi-dimensions.Each block is called a workgroup.The size of a workgroup can be up to 1024 or 2048 work-items depending on device capability.All work-items inside the workgroup can be synchronized using barrier.However, synchronization cannot be between workgroups, and they could be executed in any order. 7e Altera SDK for OpenCL allows the programmer to implement parallel algorithms on FPGA with a high level of hardware abstraction.The Altera offline compiler (AOC) is used to generate the Altera executable file, which can be run on the FPGA (DE5 in this study.Each kernel is synthesized to an equivalent circuit on the FPGA board, and each circuit contains a set of hardware recourses.FPGAs implement parallel algorithm using pipelining architecture where input data passes through a sequence of stages. 4,5FPGA main resources include Adaptive logic modules (ALM), digital signal processing (DSP) and memory blocks.
AOC is used to create a hardware configuration file.Some parameters can be combined for the optimization purpose.Compilation process is very length, which can range anywhere between minutes and several days.In the set of benchmarks here, the compilation time ranges between one hour and few minutes up to six hours and few minutes.
The Altera SDK 14.0 has been developed to include new features, such as supporting hard floating points, channel extensions, supporting new types (float 3) and other features. 5Our motivation behind this study is to shows how these new features could affect the performance by compiling and running set of benchmarks.

Experiment and Results Discussion
Several studies handel the issue of comparing different compilers. 3,4To compare the two Altera SDK versions, a set of fifteen benchmarks were developed for comparison purpose.These benchmarks are varied in their characteristics as follows none, one, or more atomic operations, with/without loop unrolling, single/multiple kernels per file.
The benchmarks written can be classified as pure memory access, where the whole kernel is written using reads or writes memory operations.The read/ write operations could be atomic or non-atomic, using same or different atomic operation."atomic add" and atomic exchange are used in this study.
The other class is consisted of a set of arithmetic operations on floating points.These operations include four main operations (addition, subtraction, division, and multiplication).The OpenCL kernels can repeat the same code many times, where loop unrolling is used in some kernels.The same kernels run again but without loop unrolling in other benchmarks.The last thing tested using theses benchmarks is repeating the same kernel in the file up to seven times, or using more than one kernel with different characteristics.In summary, a set of fifteen benchmarks summaries all of the above attributes.
A set of parameters are concerned here: logic utilization in ALMs, RAM blocks, total memory bits, clock frequency, total registers and compile time.
Other parameters might be added here are size of configuration and backup files created.Our results show that the size of the files created by Altera 14.0 is less by 400MBs The FPGA device used in the experiments contains 234,720 ALMs, 256 DSP Blocks, 52,428,800 block memory bits and 2,560 RAM Blocks.

Conclusion
Our study shows that using the Altera SDK 14.0 for the previous benchmarks provides better recourses utilization.We need fewer resources compared to the Altera SDK 13.0.Although the clock speed may decrease or increase, the changes is insignificant.
We recommend using Altera SDK14.0 instead of Altera SDK 13.0.In future paper, the comparison will handle the most recent Intel FPGA compilers.

Table 2 : Altera 14.0 Benchmarks results Altera 14.0 Bench1 Bench2 Bench3 Bench4 Bench5 Bench6 Bench7 Bench8 Bench9 Bench10 Benchll Benchl2 Bench13 Benchl4 Benchl5
, it is clear that the Altera SDK 14.0 shows better optimization of resources, and requires less compilation time.On the other hand, the clock frequency may not be enhanced, but may be decreased.Dividing the values in TableIIby the corresponding values in TableIand averaging each row will generate the results shown in TableIII.This gives a comparison between the two versions considering the parameters mentioned above.