Abstract
GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the possibility for bugs as the number of thread interleavings balloons. Conventional dynamic safety analyses struggle to run at this scale.
We present BARRACUDA, a concurrency bug detector for GPU programs written in Nvidia’s CUDA language. BARRACUDA handles a wider range of parallelism constructs than previous work, including branch operations, low-level atomics and memory fences, which allows BARRACUDA to detect new classes of concurrency bugs. BARRACUDA operates at the binary level for increased compatibility with existing code, leveraging a new binary instrumentation framework that is extensible to other dynamic analyses. BARRACUDA incorporates a number of novel optimizations that are crucial for scaling concurrency bug detection to over a million threads.
- Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2015. Google ScholarDigital Library
- Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. Engineering a Static Verification Tool for GPU Kernels. In Proceedings of the International Conference on Computer Aided Verification, CAV, 2014. Google ScholarDigital Library
- Ethel Bardsley and Alastair F. Donaldson. Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods - Volume 8430, 2014. Google ScholarDigital Library
- Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: A Verifier for GPU Kernels. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2012. Google ScholarDigital Library
- Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Transactions on Programming Languages and Systems, 37(3), May 2015. Google ScholarDigital Library
- Pavol Bielik, Veselin Raychev, and Martin Vechev. Scalable Race Detection for Android Applications. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2015. Google ScholarDigital Library
- Hans-J. Boehm and Sarita V. Adve. Foundations of the C++ concurrency memory model. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2008. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, IISWC, 2009. Google ScholarDigital Library
- Wei-Fan Chiang, Ganesh Gopalakrishnan, Guodong Li, and Zvonimir Rakamari´c. Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding. 2013.Google Scholar
- Nathan Chong, Alastair F. Donaldson, Paul H.J. Kelly, Jeroen Ketema, and Shaz Qadeer. Barrier Invariants: A Shared State Abstraction for the Analysis of Data-dependent GPU Kernels. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2013. Google ScholarDigital Library
- Nathan Chong, Alastair F. Donaldson, and Jeroen Ketema. A Sound and Complete Abstraction for Reasoning About Parallel Prefix Sums. In Proceedings of the ACM SIGPLAN Symposium on Principles of Programming Languages, POPL, 2014. Google ScholarDigital Library
- Peter Collingbourne, Cristian Cadar, and Paul H. J. Kelly. Symbolic Testing of OpenCL Code. In Proceedings of the 7th International Haifa Verification Conference on Hardware and Software: Verification and Testing, HVC’11, 2012. Google ScholarDigital Library
- Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. Interleaving and Lock-step Semantics for Analysis and Verification of GPU Kernels. In Proceedings of the European Symposium on Programming Languages and Systems, ESOP, 2013. Google ScholarDigital Library
- Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, 2010. Google ScholarDigital Library
- Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: a race and transaction-aware java runtime. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, Jun 2007. Google ScholarDigital Library
- W. W. L. Fung et al. KiloTM Benchmarks, 2013. http://www.ece.ubc.ca/ wwlfung/code/kilotm-gpgpu sim.tgz.Google Scholar
- Naila Farooqui, Andrew Kerr, Gregory Diamos, S. Yalamanchili, and K. Schwan. A Framework for Dynamically Instrumenting GPU Compute Applications Within GPU Ocelot. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, 2011. Google ScholarDigital Library
- Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, and Sudhakar Yalamanchili. Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, 2012. Google ScholarDigital Library
- Colin Fidge. Logical time in distributed computing systems. IEEE Computer, 24(8), Aug 1991. Google ScholarDigital Library
- Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2009. Google ScholarDigital Library
- Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. Communications of the ACM, 53(11), Nov 2010. Google ScholarDigital Library
- Cormac Flanagan and Stephen N. Freund. RedCard: Redundant Check Elimination for Dynamic Race Detectors. In Proceedings of the European Conference on Object-Oriented Programming, ECOOP, 2013. Google ScholarDigital Library
- HSA Foundation. HSA Memory Consistency Model. http://www.hsafoundation.com/html/HSA Library.htm#-SysArch/Topics/03 Memory/ chpStr HSA memory consistency model.htm.Google Scholar
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO, 2007. Google ScholarDigital Library
- Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Aamodt. Hardware Transactional Memory for GPU Architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO, 2011. Google ScholarDigital Library
- Benedict R. Gaster, Derek Hower, and Lee Howes. HRFRelaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models. ACM Transactions on Architecture and Code Optimization, 12(1), Apr 2015. Google ScholarDigital Library
- Anup Holey, Vineeth Mekkat, and Antonia Zhai. HAccRG: Hardware-Accelerated Data Race Detection in GPUs. In Proceedings of the International Conference on Parallel Processing, ICPP, 2013. Google ScholarDigital Library
- Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. Heterogeneous-race-free Memory Models. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2014. Google ScholarDigital Library
- Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L. Pereira, Gilles A. Pokam, Peter M. Chen, and Jason Flinn. Race Detection for Event-driven Mobile Applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2014. Google ScholarDigital Library
- Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9), Sep 1979. Google ScholarDigital Library
- Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. Verifying GPU Kernels by Test Amplification. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2012. Google ScholarDigital Library
- Guodong Li and Ganesh Gopalakrishnan. Scalable SMTbased Verification of GPU Kernel Functions. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE, 2010. Google ScholarDigital Library
- Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. GKLEE: Concolic Verification and Test Generation for GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2012. Google ScholarDigital Library
- Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. LDetector: A Low Overhead Race Detector For GPU Programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming (WODET ’14), 2014.Google Scholar
- Friedemann Mattern. Virtual Time and Global States of Distributed Systems. In Parallel and Distributed Algorithms, 1989.Google Scholar
- Michael Boyer, Kevin Skadron, and Westley Weimer. Automated Dynamic Analysis of CUDA Programs. In Workshop on Software Tools for MultiCore Systems, 2008.Google Scholar
- Nvidia. CUDA C Programming Guide v7.5. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google Scholar
- Nvidia. Parallel Thread Execution ISA Version 4.3. http://docs.nvidia.com/cuda/parallel-thread-execution/.Google Scholar
- Nvidia. Racecheck Tool. http://docs.nvidia.com/cuda/cudamemcheck/index.html#racecheck-tool.Google Scholar
- Nvidia. SASSI Instrumentation Tool for NVIDIA GPUs, 2016. https://github.com/NVlabs/SASSI.Google Scholar
- Eli Pozniansky and Assaf Schuster. Efficient on-the-fly data race detection in multithreaded C++ programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2003. Google ScholarDigital Library
- Veselin Raychev, Martin Vechev, and Manu Sridharan. Effective Race Detection for Event-driven Programs. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2013. Google ScholarDigital Library
- Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4), Nov 1997. Google ScholarDigital Library
- Tyler Sorensen and Alastair F. Donaldson. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2016. Google ScholarDigital Library
- John Wickerson, Mark Batty, Bradford M. Beckmann, and Alastair F. Donaldson. Remote-scope Promotion: Clarified, Rectified, and Verified. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2015. Google ScholarDigital Library
- M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Transactions on Parallel and Distributed Systems, 25(1), 2014. Google ScholarDigital Library
- Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2011. Google ScholarDigital Library
Index Terms
- BARRACUDA: binary-level analysis of runtime RAces in CUDA programs
Recommendations
CURD: a dynamic CUDA race detector
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationAs GPUs have become an integral part of nearly every pro- cessor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs ...
CURD: a dynamic CUDA race detector
PLDI '18As GPUs have become an integral part of nearly every pro- cessor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs ...
BARRACUDA: binary-level analysis of runtime RAces in CUDA programs
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationGPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the ...
Comments