BARRACUDA: binary-level analysis of runtime RAces in CUDA programs

Authors:
Ariel Eizenberg

University of Pennsylvania, USA

University of Pennsylvania, USA
View Profile

,
Yuanfeng Peng

University of Pennsylvania, USA

University of Pennsylvania, USA
View Profile

,
Toma Pigli

University of Pennsylvania, USA

University of Pennsylvania, USA
View Profile

,
William Mansky

Princeton University, USA

Princeton University, USA
View Profile

,
Joseph Devietti

University of Pennsylvania, USA

University of Pennsylvania, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 52 Issue 6June 2017pp 126–140https://doi.org/10.1145/3140587.3062342

Published:14 June 2017Publication History

ACM SIGPLAN Notices

Abstract

GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the possibility for bugs as the number of thread interleavings balloons. Conventional dynamic safety analyses struggle to run at this scale.

We present BARRACUDA, a concurrency bug detector for GPU programs written in Nvidia’s CUDA language. BARRACUDA handles a wider range of parallelism constructs than previous work, including branch operations, low-level atomics and memory fences, which allows BARRACUDA to detect new classes of concurrency bugs. BARRACUDA operates at the binary level for increased compatibility with existing code, leveraging a new binary instrumentation framework that is extensible to other dynamic analyses. BARRACUDA incorporates a number of novel optimizations that are crucial for scaling concurrency bug detection to over a million threads.

References

Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2015. Google ScholarDigital Library
Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. Engineering a Static Verification Tool for GPU Kernels. In Proceedings of the International Conference on Computer Aided Verification, CAV, 2014. Google ScholarDigital Library
Ethel Bardsley and Alastair F. Donaldson. Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods - Volume 8430, 2014. Google ScholarDigital Library
Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: A Verifier for GPU Kernels. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2012. Google ScholarDigital Library
Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Transactions on Programming Languages and Systems, 37(3), May 2015. Google ScholarDigital Library
Pavol Bielik, Veselin Raychev, and Martin Vechev. Scalable Race Detection for Android Applications. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2015. Google ScholarDigital Library
Hans-J. Boehm and Sarita V. Adve. Foundations of the C++ concurrency memory model. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2008. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, IISWC, 2009. Google ScholarDigital Library
Wei-Fan Chiang, Ganesh Gopalakrishnan, Guodong Li, and Zvonimir Rakamari´c. Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding. 2013.Google Scholar
Nathan Chong, Alastair F. Donaldson, Paul H.J. Kelly, Jeroen Ketema, and Shaz Qadeer. Barrier Invariants: A Shared State Abstraction for the Analysis of Data-dependent GPU Kernels. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2013. Google ScholarDigital Library
Nathan Chong, Alastair F. Donaldson, and Jeroen Ketema. A Sound and Complete Abstraction for Reasoning About Parallel Prefix Sums. In Proceedings of the ACM SIGPLAN Symposium on Principles of Programming Languages, POPL, 2014. Google ScholarDigital Library
Peter Collingbourne, Cristian Cadar, and Paul H. J. Kelly. Symbolic Testing of OpenCL Code. In Proceedings of the 7th International Haifa Verification Conference on Hardware and Software: Verification and Testing, HVC’11, 2012. Google ScholarDigital Library
Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. Interleaving and Lock-step Semantics for Analysis and Verification of GPU Kernels. In Proceedings of the European Symposium on Programming Languages and Systems, ESOP, 2013. Google ScholarDigital Library
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, 2010. Google ScholarDigital Library
Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: a race and transaction-aware java runtime. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, Jun 2007. Google ScholarDigital Library
W. W. L. Fung et al. KiloTM Benchmarks, 2013. http://www.ece.ubc.ca/ wwlfung/code/kilotm-gpgpu sim.tgz.Google Scholar
Naila Farooqui, Andrew Kerr, Gregory Diamos, S. Yalamanchili, and K. Schwan. A Framework for Dynamically Instrumenting GPU Compute Applications Within GPU Ocelot. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, 2011. Google ScholarDigital Library
Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, and Sudhakar Yalamanchili. Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, 2012. Google ScholarDigital Library
Colin Fidge. Logical time in distributed computing systems. IEEE Computer, 24(8), Aug 1991. Google ScholarDigital Library
Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2009. Google ScholarDigital Library
Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. Communications of the ACM, 53(11), Nov 2010. Google ScholarDigital Library
Cormac Flanagan and Stephen N. Freund. RedCard: Redundant Check Elimination for Dynamic Race Detectors. In Proceedings of the European Conference on Object-Oriented Programming, ECOOP, 2013. Google ScholarDigital Library
HSA Foundation. HSA Memory Consistency Model. http://www.hsafoundation.com/html/HSA Library.htm#-SysArch/Topics/03 Memory/ chpStr HSA memory consistency model.htm.Google Scholar
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO, 2007. Google ScholarDigital Library
Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Aamodt. Hardware Transactional Memory for GPU Architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO, 2011. Google ScholarDigital Library
Benedict R. Gaster, Derek Hower, and Lee Howes. HRFRelaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models. ACM Transactions on Architecture and Code Optimization, 12(1), Apr 2015. Google ScholarDigital Library
Anup Holey, Vineeth Mekkat, and Antonia Zhai. HAccRG: Hardware-Accelerated Data Race Detection in GPUs. In Proceedings of the International Conference on Parallel Processing, ICPP, 2013. Google ScholarDigital Library
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. Heterogeneous-race-free Memory Models. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2014. Google ScholarDigital Library
Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L. Pereira, Gilles A. Pokam, Peter M. Chen, and Jason Flinn. Race Detection for Event-driven Mobile Applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2014. Google ScholarDigital Library
Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9), Sep 1979. Google ScholarDigital Library
Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. Verifying GPU Kernels by Test Amplification. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2012. Google ScholarDigital Library
Guodong Li and Ganesh Gopalakrishnan. Scalable SMTbased Verification of GPU Kernel Functions. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE, 2010. Google ScholarDigital Library
Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. GKLEE: Concolic Verification and Test Generation for GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2012. Google ScholarDigital Library
Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. LDetector: A Low Overhead Race Detector For GPU Programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming (WODET ’14), 2014.Google Scholar
Friedemann Mattern. Virtual Time and Global States of Distributed Systems. In Parallel and Distributed Algorithms, 1989.Google Scholar
Michael Boyer, Kevin Skadron, and Westley Weimer. Automated Dynamic Analysis of CUDA Programs. In Workshop on Software Tools for MultiCore Systems, 2008.Google Scholar
Nvidia. CUDA C Programming Guide v7.5. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google Scholar
Nvidia. Parallel Thread Execution ISA Version 4.3. http://docs.nvidia.com/cuda/parallel-thread-execution/.Google Scholar
Nvidia. Racecheck Tool. http://docs.nvidia.com/cuda/cudamemcheck/index.html#racecheck-tool.Google Scholar
Nvidia. SASSI Instrumentation Tool for NVIDIA GPUs, 2016. https://github.com/NVlabs/SASSI.Google Scholar
Eli Pozniansky and Assaf Schuster. Efficient on-the-fly data race detection in multithreaded C++ programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2003. Google ScholarDigital Library
Veselin Raychev, Martin Vechev, and Manu Sridharan. Effective Race Detection for Event-driven Programs. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2013. Google ScholarDigital Library
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4), Nov 1997. Google ScholarDigital Library
Tyler Sorensen and Alastair F. Donaldson. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2016. Google ScholarDigital Library
John Wickerson, Mark Batty, Bradford M. Beckmann, and Alastair F. Donaldson. Remote-scope Promotion: Clarified, Rectified, and Verified. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2015. Google ScholarDigital Library
M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Transactions on Parallel and Distributed Systems, 25(1), 2014. Google ScholarDigital Library
Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2011. Google ScholarDigital Library

Index Terms

BARRACUDA: binary-level analysis of runtime RAces in CUDA programs
1. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

CURD: a dynamic CUDA race detector
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

As GPUs have become an integral part of nearly every pro- cessor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs ...
Read More
CURD: a dynamic CUDA race detector
PLDI '18

As GPUs have become an integral part of nearly every pro- cessor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs ...
Read More
BARRACUDA: binary-level analysis of runtime RAces in CUDA programs
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation

GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 52, Issue 6
PLDI '17
June 2017
708 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3140587
Editor:
Matthew Fluet
Issue’s Table of Contents
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2017
708 pages
ISBN:9781450349888
DOI:10.1145/3062341
General Chair:
Albert Cohen
Inria, France
,
Program Chair:
Martin Vechev
DeepCode, Switzerland / ETH Zurich, Switzerland
Copyright © 2017 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2017
Check for updates
Author Tags
CUDA
GPUs
data race detection
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 795
  Total Downloads
- Downloads (Last 12 months)128
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BARRACUDA: binary-level analysis of runtime RAces in CUDA programs

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

CURD: a dynamic CUDA race detector

CURD: a dynamic CUDA race detector

BARRACUDA: binary-level analysis of runtime RAces in CUDA programs