skip to main content
article
Public Access

BARRACUDA: binary-level analysis of runtime RAces in CUDA programs

Published:14 June 2017Publication History
Skip Abstract Section

Abstract

GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the possibility for bugs as the number of thread interleavings balloons. Conventional dynamic safety analyses struggle to run at this scale.

We present BARRACUDA, a concurrency bug detector for GPU programs written in Nvidia’s CUDA language. BARRACUDA handles a wider range of parallelism constructs than previous work, including branch operations, low-level atomics and memory fences, which allows BARRACUDA to detect new classes of concurrency bugs. BARRACUDA operates at the binary level for increased compatibility with existing code, leveraging a new binary instrumentation framework that is extensible to other dynamic analyses. BARRACUDA incorporates a number of novel optimizations that are crucial for scaling concurrency bug detection to over a million threads.

References

  1. Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ethel Bardsley, Adam Betts, Nathan Chong, Peter Collingbourne, Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Daniel Liew, and Shaz Qadeer. Engineering a Static Verification Tool for GPU Kernels. In Proceedings of the International Conference on Computer Aided Verification, CAV, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ethel Bardsley and Alastair F. Donaldson. Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels. In Proceedings of the 6th International Symposium on NASA Formal Methods - Volume 8430, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson. GPUVerify: A Verifier for GPU Kernels. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Adam Betts, Nathan Chong, Alastair F. Donaldson, Jeroen Ketema, Shaz Qadeer, Paul Thomson, and John Wickerson. The Design and Implementation of a Verification Technique for GPU Kernels. ACM Transactions on Programming Languages and Systems, 37(3), May 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Pavol Bielik, Veselin Raychev, and Martin Vechev. Scalable Race Detection for Android Applications. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hans-J. Boehm and Sarita V. Adve. Foundations of the C++ concurrency memory model. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wei-Fan Chiang, Ganesh Gopalakrishnan, Guodong Li, and Zvonimir Rakamari´c. Formal Analysis of GPU Programs with Atomics via Conflict-Directed Delay-Bounding. 2013.Google ScholarGoogle Scholar
  10. Nathan Chong, Alastair F. Donaldson, Paul H.J. Kelly, Jeroen Ketema, and Shaz Qadeer. Barrier Invariants: A Shared State Abstraction for the Analysis of Data-dependent GPU Kernels. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nathan Chong, Alastair F. Donaldson, and Jeroen Ketema. A Sound and Complete Abstraction for Reasoning About Parallel Prefix Sums. In Proceedings of the ACM SIGPLAN Symposium on Principles of Programming Languages, POPL, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter Collingbourne, Cristian Cadar, and Paul H. J. Kelly. Symbolic Testing of OpenCL Code. In Proceedings of the 7th International Haifa Verification Conference on Hardware and Software: Verification and Testing, HVC’11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Peter Collingbourne, Alastair F. Donaldson, Jeroen Ketema, and Shaz Qadeer. Interleaving and Lock-step Semantics for Analysis and Verification of GPU Kernels. In Proceedings of the European Symposium on Programming Languages and Systems, ESOP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: a race and transaction-aware java runtime. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, Jun 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. W. L. Fung et al. KiloTM Benchmarks, 2013. http://www.ece.ubc.ca/ wwlfung/code/kilotm-gpgpu sim.tgz.Google ScholarGoogle Scholar
  17. Naila Farooqui, Andrew Kerr, Gregory Diamos, S. Yalamanchili, and K. Schwan. A Framework for Dynamically Instrumenting GPU Compute Applications Within GPU Ocelot. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, and Sudhakar Yalamanchili. Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Colin Fidge. Logical time in distributed computing systems. IEEE Computer, 24(8), Aug 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. Communications of the ACM, 53(11), Nov 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Cormac Flanagan and Stephen N. Freund. RedCard: Redundant Check Elimination for Dynamic Race Detectors. In Proceedings of the European Conference on Object-Oriented Programming, ECOOP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. HSA Foundation. HSA Memory Consistency Model. http://www.hsafoundation.com/html/HSA Library.htm#-SysArch/Topics/03 Memory/ chpStr HSA memory consistency model.htm.Google ScholarGoogle Scholar
  24. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wilson W. L. Fung, Inderpreet Singh, Andrew Brownsword, and Tor M. Aamodt. Hardware Transactional Memory for GPU Architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Benedict R. Gaster, Derek Hower, and Lee Howes. HRFRelaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models. ACM Transactions on Architecture and Code Optimization, 12(1), Apr 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Anup Holey, Vineeth Mekkat, and Antonia Zhai. HAccRG: Hardware-Accelerated Data Race Detection in GPUs. In Proceedings of the International Conference on Parallel Processing, ICPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. Heterogeneous-race-free Memory Models. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L. Pereira, Gilles A. Pokam, Peter M. Chen, and Jason Flinn. Race Detection for Event-driven Mobile Applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9), Sep 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alan Leung, Manish Gupta, Yuvraj Agarwal, Rajesh Gupta, Ranjit Jhala, and Sorin Lerner. Verifying GPU Kernels by Test Amplification. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Guodong Li and Ganesh Gopalakrishnan. Scalable SMTbased Verification of GPU Kernel Functions. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Guodong Li, Peng Li, Geof Sawaya, Ganesh Gopalakrishnan, Indradeep Ghosh, and Sreeranga P. Rajan. GKLEE: Concolic Verification and Test Generation for GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pengcheng Li, Chen Ding, Xiaoyu Hu, and Tolga Soyata. LDetector: A Low Overhead Race Detector For GPU Programs. In Proceedings of the 5th Workshop on Determinism and Correctness in Parallel Programming (WODET ’14), 2014.Google ScholarGoogle Scholar
  35. Friedemann Mattern. Virtual Time and Global States of Distributed Systems. In Parallel and Distributed Algorithms, 1989.Google ScholarGoogle Scholar
  36. Michael Boyer, Kevin Skadron, and Westley Weimer. Automated Dynamic Analysis of CUDA Programs. In Workshop on Software Tools for MultiCore Systems, 2008.Google ScholarGoogle Scholar
  37. Nvidia. CUDA C Programming Guide v7.5. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google ScholarGoogle Scholar
  38. Nvidia. Parallel Thread Execution ISA Version 4.3. http://docs.nvidia.com/cuda/parallel-thread-execution/.Google ScholarGoogle Scholar
  39. Nvidia. Racecheck Tool. http://docs.nvidia.com/cuda/cudamemcheck/index.html#racecheck-tool.Google ScholarGoogle Scholar
  40. Nvidia. SASSI Instrumentation Tool for NVIDIA GPUs, 2016. https://github.com/NVlabs/SASSI.Google ScholarGoogle Scholar
  41. Eli Pozniansky and Assaf Schuster. Efficient on-the-fly data race detection in multithreaded C++ programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Veselin Raychev, Martin Vechev, and Manu Sridharan. Effective Race Detection for Event-driven Programs. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4), Nov 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tyler Sorensen and Alastair F. Donaldson. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. John Wickerson, Mark Batty, Bradford M. Beckmann, and Alastair F. Donaldson. Remote-scope Promotion: Clarified, Rectified, and Verified. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Transactions on Parallel and Distributed Systems, 25(1), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal. GRace: A Low-overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BARRACUDA: binary-level analysis of runtime RAces in CUDA programs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 52, Issue 6
      PLDI '17
      June 2017
      708 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3140587
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2017
        708 pages
        ISBN:9781450349888
        DOI:10.1145/3062341

      Copyright © 2017 Owner/Author

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 June 2017

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader