skip to main content
research-article

Locality-Aware CTA Clustering for Modern GPUs

Authors Info & Claims
Published:04 April 2017Publication History
Skip Abstract Section

Abstract

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.

References

  1. Steven S. Muchnick. Advanced compiler design implementation. Morgan Kaufmann, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Randy Allen and Ken Kennedy. Optimizing compilers for modern architectures a dependence-based approach. Morgan Kaufmann, 2001.Google ScholarGoogle Scholar
  3. Jingling Xue. Loop tiling for parallelism, volume 575. Springer Science & Business Media, 2012.Google ScholarGoogle Scholar
  4. James Philbin, Jan Edler, Otto J Anshus, Craig C Douglas, and Kai Li. Thread scheduling for cache locality. In ACM SIGOPS Operating Systems Review, volume 30, pages 60--71. ACM, 1996.Google ScholarGoogle Scholar
  5. David Tam, Reza Azimi, and Michael Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM SIGOPS Operating Systems Review, volume 41, pages 47--58. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. NVIDIA. CUDA Programming Guide, 2015.Google ScholarGoogle Scholar
  8. Mengjie Mao, Wujie Wen, Xiaoxiao Liu, Jingtong Hu, Danghui Wang, Yiran Chen, and Hai Li. TEMP: thread batch enabled memory partitioning for GPU. In Proceedings of the 53rd Annual Design Automation Conference, page 65. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235--246. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40, pages 407--420. IEEE Computer Society, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 17. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 67--77. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 343--355. IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design, ICCAD '13, pages 516--523. IEEE Press, 2013. Google ScholarGoogle ScholarCross RefCross Ref
  16. R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 25--38, Oct 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing, page 43. ACM, 2016.Google ScholarGoogle Scholar
  18. Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 272--283. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 76--88. IEEE, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  21. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  22. Bo-Cheng Charles Lai, Hsien-Kai Kuo, and Jing-Yang Jou. A cache hierarchy aware thread mapping methodology for GPGPUs. IEEE Transactions on Computers (TC), 64(4):884--898, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  23. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. Comput. Syst, 26:63--72, 2009.Google ScholarGoogle Scholar
  24. Nikolaj Leischner, Vitaly Osipov, and Peter Sanders. Fermi architecture white paper.Google ScholarGoogle Scholar
  25. Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), pages 119--130. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kunal Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), pages 1--14. IEEE, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  27. Minseok Lee, Seokwoo Song, Joosik Moon, Jung-Ho Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260--271. IEEE, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  28. NVIDIA. GTX980 Whitepaper: Featuring Maxwell, the Most Advanced GPU Ever Made, 2014.Google ScholarGoogle Scholar
  29. Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O'Connor, and Tor M Aamodt. Cache coherence for GPU architectures. In Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 578--590. IEEE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-Mei Hwu. Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (PACT), pages 157--166. IEEE Press, 2013.Google ScholarGoogle Scholar
  32. Hyeran Jeon, Gunjae Koo, and Murali Annavaram. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08, 2014.Google ScholarGoogle Scholar
  33. Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pages 583--595. IEEE Press, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N Patt. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 308--317. ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Sethia, D. A. Jamshidi, and S. Mahlke. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 174--185, Feb 2015. Google ScholarGoogle ScholarCross RefCross Ref
  36. Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lingda Li, Ari B Hayes, Stephen A Hackler, Eddy Z Zhang, Mario Szegedy, and Shuaiwen Leon Song. A Graph-based Model for GPU Caching Problems. arXiv preprint arXiv:1605.02043, 2016.Google ScholarGoogle Scholar
  38. Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '11, pages 369--380. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the 2016 International Conference on Supercomputing (ICS), page 2. ACM, 2016.Google ScholarGoogle Scholar
  40. NVIDIA. CUDA SDK Code Samples, 2015.Google ScholarGoogle Scholar
  41. Konstantin Andreev and Harald Racke. Balanced graph partitioning. Theory of Computing Systems, 39(6):929--939, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ang Li, Shuaiwen Leon Song, Mark Wijtvliet, Akash Kumar, and Henk Corporaal. SFU-Driven Transparent Approximation Acceleration on GPUs. In Proceedings of the International Conference on Supercomputing (ICS), page 15. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual International Symposium on Microarchitecture (MICRO), pages 213--224. IEEE, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Nagesh B Lakshminarayana and Hyesoon Kim. Spare register aware prefetching for graph algorithms on GPUs. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 614--625. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  46. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.Google ScholarGoogle Scholar
  48. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  49. NVIDIA. CUDA Profiler User's Guide, 2015.Google ScholarGoogle Scholar
  50. Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), pages 15--24. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual International Symposium on Microarchitecture, MICRO-46, pages 99--110. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 395--407. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Locality-Aware CTA Clustering for Modern GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 52, Issue 4
        ASPLOS '17
        April 2017
        811 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3093336
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
          April 2017
          856 pages
          ISBN:9781450344654
          DOI:10.1145/3037697

        Copyright © 2017 ACM

        © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 April 2017

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader