research-article

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

Authors:
Suejb Memeti

Linnaeus University, Växjö, Sweden

Linnaeus University, Växjö, Sweden
View Profile

,
Lu Li

Linköping University, Linköping, Sweden

Linköping University, Linköping, Sweden
View Profile

,
Sabri Pllana

Linnaeus University, Växjö, Sweden

Linnaeus University, Växjö, Sweden
View Profile

,
Joanna Kołodziej

NASK, Warsaw, Poland

NASK, Warsaw, Poland
View Profile

,
Christoph Kessler

Linköping University, Linköping, Sweden

Linköping University, Linköping, Sweden
View Profile

ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingJuly 2017Pages 1–6https://doi.org/10.1145/3110355.3110356

Published:28 July 2017Publication History

ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Pages 1–6

ABSTRACT

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. However, exploiting the available performance of heterogeneous architectures may be challenging. There are various parallel programming frameworks (such as, OpenMP, OpenCL, OpenACC, CUDA) and selecting the one that is suitable for a target context is not straightforward. In this paper, we study empirically the characteristics of OpenMP, OpenACC, OpenCL, and CUDA with respect to programming productivity, performance, and energy. To evaluate the programming productivity we use our homegrown tool CodeStat, which enables us to determine the percentage of code lines required to parallelize the code using a specific framework. We use our tools MeterPU and x-MeterPU to evaluate the energy consumption and the performance. Experiments are conducted using the industry-standard SPEC benchmark suite and the Rodinia benchmark suite for accelerated computing on heterogeneous systems that combine Intel Xeon E5 Processors with a GPU accelerator or an Intel Xeon Phi co-processor.

References

Erika Abraham, Costas Bekas, Ivona Brandic, Samir Genaim, Einar Broch Johnsen, Ivan Kondov, Sabri Pllana, and A. Achim Streit. 2015. Preparing HPC Applications for Exascale: Challenges and Recommendations. In 18th International Conference on Network-Based Information Systems (NBiS). 401--406. https://doi.org/10.1109/NBiS.2015.61Google Scholar
Siegfried Benkner, Sabri Pllana, Jesper Larsson Traff, Philippas Tsigas, Uwe Dolinsky, Cedric Augonnet, Beverly Bachmayer, Christoph Kessler, David Moloney, and Vitaly Osipov. 2011. PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems. Micro, IEEE 31, 5 (Sept 2011), 28--41. 0272--1732Google Scholar
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009. IEEE, 44--54.Google ScholarDigital Library
George Chrysos. 2014. Intel® Xeon Phi? Coprocessor-the Architecture. Intel Whitepaper (2014).Google Scholar
Daniel Grzonka, Agnieszka Jakobik, Joanna Kolodziej, and Sabri Pllana. 2017. Using a multi-agent system and artificial intelligence for monitoring and improving the cloud performance and security. Future Generation Computer Systems (2017). 0167--739X https://doi.org/10.1016/j.future.2017.05.046Google Scholar
Guido Juckeland, William Brantley, Sunita Chandrasekaran, Barbara Chapman, Shuai Che, Mathew Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-Mei W Hwu, et al\mbox. 2014. SPEC ACCEL: a standard application suite for measuring hardware accelerator performance. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, 46--67.Google Scholar
Christoph Kessler, Usman Dastgeer, Samuel Thibault, Raymond Namyst, Andrew Richards, Uwe Dolinsky, Siegfried Benkner, Jesper Larsson Traff, and Sabri Pllana. 2012. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. In 2012 Design, Automation Test in Europe Conference Exhibition (DATE). 1403--1408. 1530--1591 https://doi.org/10.1109/DATE.2012.6176582Google Scholar
Xuechao Li, Po-Chou Shih, Jeffrey Overbey, Cheryl Seals, and Alvin Lim. 2016. Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. International Journal of Computer Science, Engineering and Applications (IJCSEA) 6, 5 (2016), 1--15. https://doi.org/10.5121/ijcsea.2016.6501Google ScholarCross Ref
Lu Li and Christoph Kessler. 2016. MeterPU: A Generic Measurement Abstraction API Enabling Energy-tuned Skeleton Backend Selection. Journal of Supercomputing (2016), 1--16. https://doi.org/10.1007/s11227-016--1792-x Google ScholarDigital Library
Suejb Memeti and Sabri Pllana. 2015. Accelerating DNA Sequence Analysis Using Intel(R) Xeon Phi(TM). In 2015 IEEE Trustcom/BigDataSE/ISPA, Vol. 3. 222--227.Google ScholarDigital Library
Sparsh Mittal and Jeffrey S Vetter. 2015. A survey of cpu-gpu heterogeneous computing techniques. ACM Computing Surveys (CSUR) 47, 4 (2015), 69.Google ScholarDigital Library
NVIDIA. 2016. CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. (September 2016). Accessed: 2017-03-06.Google Scholar
NVIDIA. 2017. What is GPU-Accelerated Computing? http://www.nvidia.com/object/what-is-gpu-computing.html. (April 2017). Accessed: 2017-04-03.Google Scholar
OpenMP. 2013. OpenMP 4.0 Specifications. http://www.openmp.org/specifications/. (July 2013). Accessed: 2017-03--10.Google Scholar
Rodinia. 2015. Rodinia:Accelerating Compute-Intensive Applications with Accelerators. (December 2015). http://www.cs.virginia.edu/skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators Last accessed: 10 April 2017.Google Scholar
SPEC. 2017. SPEC ACCEL: Read Me First. https://www.spec.org/accel/docs/readme1st.html#Q11. (February 2017). Accessed: 2017-04--10.Google Scholar
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 1--3 (2010), 66--73.Google Scholar
Ching-Lung Su, Po-Yu Chen, Chun-Chieh Lan, Long-Sheng Huang, and Kuo-Hsuan Wu. 2012. Overview and comparison of OpenCL and CUDA technology for GPGPU. In 2012 IEEE Asia Pacific Conference on Circuits and Systems. 448--451. https://doi.org/10.1109/APCCAS.2012.6419068Google ScholarCross Ref
Andre Viebke and Sabri Pllana. 2015. The Potential of the Intel (R) Xeon Phi for Supervised Deep Learning. In 2015 IEEE 17th International Conference on High Performance Computing and Communications. 758--765. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.45Google ScholarDigital Library
Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC: First Experiences with Real-world Applications. In Proceedings of the 18th International Conference on Parallel Processing (Euro-Par'12). Springer-Verlag, Berlin, Heidelberg, 859--870. x978--3--642--32819-0Google ScholarDigital Library
Yonghong Yan, Barbara M. Chapman, and Michael Wong. 2015. A comparison of heterogeneous and manycore programming models. https://goo.gl/81A4iV. (March 2015). Accessed: 2017-03--31.Google Scholar

Index Terms

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

Recommendations

A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware
IWOCL '22: Proceedings of the 10th International Workshop on OpenCL

In scientific computing and Artificial Intelligence (AI), which both rely on massively parallel tasks, frameworks like the Compute Unified Device Architecture (CUDA) and the Open Computing Language (OpenCL) are widely used to harvest the computational ...
Read More
Generating OpenCL C kernels from OpenACC
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target ...
Read More
CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application
CCGRID '13: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators in an implicit and portable way. It allows the programmer to express the offloading of data and computations to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing
July 2017
38 pages
ISBN:9781450351164
DOI:10.1145/3110355
General Chairs:
Florin Pop
University Politehnica of Bucharest, Romania
,
Radu Prodan
University of Innsbruck, Austria
,
Marc Frincu
West University of Timisoara, Romania
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CUDA
OpenACC
OpenCL
OpenMP
energy consumption
performance
programming productivity
Qualifiers
- research-article
Conference

Acceptance Rates
ARMS-CC '17 Paper Acceptance Rate4of11submissions,36%Overall Acceptance Rate4of11submissions,36%
More
Upcoming Conference
PODC '24

Sponsor:

sigact

sigact

ACM Symposium on Principles of Distributed Computing

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 983
  Total Downloads
- Downloads (Last 12 months)114
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware

Generating OpenCL C kernels from OpenACC

CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comparison of SYCL, OpenCL, CUDA, and OpenMP for Massively Parallel Support Vector Machine Classification on Multi-Vendor Hardware

Generating OpenCL C kernels from OpenACC

CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media