research-article

Locality-Aware CTA Clustering for Modern GPUs

Authors:
Ang Li

Pacific Northwest National Lab, Richland, WA, USA

Pacific Northwest National Lab, Richland, WA, USA
View Profile

,
Shuaiwen Leon Song

Pacific Northwest National Lab, Richland, WA, USA

Pacific Northwest National Lab, Richland, WA, USA
View Profile

,
Weifeng Liu

University of Copenhagen, Copenhagen, Denmark

University of Copenhagen, Copenhagen, Denmark
View Profile

,
Xu Liu

College of William and Mary, Williamsburg, VA, USA

College of William and Mary, Williamsburg, VA, USA
View Profile

,
Akash Kumar

Technische Universität Dresden, Dresden, Germany

Technische Universität Dresden, Dresden, Germany
View Profile

,
Henk Corporaal

Eindhoven University of Technology, Eindhoven , Netherlands

Eindhoven University of Technology, Eindhoven , Netherlands
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 52 Issue 4April 2017pp 297–311https://doi.org/10.1145/3093336.3037709

Published:04 April 2017Publication History

ACM SIGPLAN Notices

Abstract

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.

References

Steven S. Muchnick. Advanced compiler design implementation. Morgan Kaufmann, 1997.Google ScholarDigital Library
Randy Allen and Ken Kennedy. Optimizing compilers for modern architectures a dependence-based approach. Morgan Kaufmann, 2001.Google Scholar
Jingling Xue. Loop tiling for parallelism, volume 575. Springer Science & Business Media, 2012.Google Scholar
James Philbin, Jan Edler, Otto J Anshus, Craig C Douglas, and Kai Li. Thread scheduling for cache locality. In ACM SIGOPS Operating Systems Review, volume 30, pages 60--71. ACM, 1996.Google Scholar
David Tam, Reza Azimi, and Michael Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM SIGOPS Operating Systems Review, volume 41, pages 47--58. ACM, 2007. Google ScholarDigital Library
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008. Google ScholarDigital Library
NVIDIA. CUDA Programming Guide, 2015.Google Scholar
Mengjie Mao, Wujie Wen, Xiaoxiao Liu, Jingtong Hu, Danghui Wang, Yiran Chen, and Hai Li. TEMP: thread batch enabled memory partitioning for GPU. In Proceedings of the 53rd Annual Design Automation Conference, page 65. ACM, 2016. Google ScholarDigital Library
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235--246. ACM, 2011. Google ScholarDigital Library
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40, pages 407--420. IEEE Computer Society, 2007. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406. ACM, 2013. Google ScholarDigital Library
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 17. ACM, 2015. Google ScholarDigital Library
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 67--77. ACM, 2015. Google ScholarDigital Library
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 343--355. IEEE Computer Society, 2014. Google ScholarDigital Library
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design, ICCAD '13, pages 516--523. IEEE Press, 2013. Google ScholarCross Ref
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 25--38, Oct 2015. Google ScholarDigital Library
Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing, page 43. ACM, 2016.Google Scholar
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98. ACM, 2013. Google ScholarDigital Library
Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 272--283. IEEE, 2014.Google ScholarCross Ref
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 76--88. IEEE, 2015. Google ScholarCross Ref
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174. IEEE, 2009.Google ScholarCross Ref
Bo-Cheng Charles Lai, Hsien-Kai Kuo, and Jing-Yang Jou. A cache hierarchy aware thread mapping methodology for GPGPUs. IEEE Transactions on Computers (TC), 64(4):884--898, 2015. Google ScholarCross Ref
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. Comput. Syst, 26:63--72, 2009.Google Scholar
Nikolaj Leischner, Vitaly Osipov, and Peter Sanders. Fermi architecture white paper.Google Scholar
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), pages 119--130. ACM, 2015. Google ScholarDigital Library
Kunal Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), pages 1--14. IEEE, 2012.Google ScholarCross Ref
Minseok Lee, Seokwoo Song, Joosik Moon, Jung-Ho Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260--271. IEEE, 2014. Google ScholarCross Ref
NVIDIA. GTX980 Whitepaper: Featuring Maxwell, the Most Advanced GPU Ever Made, 2014.Google Scholar
Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O'Connor, and Tor M Aamodt. Cache coherence for GPU architectures. In Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 578--590. IEEE, 2013.Google ScholarDigital Library
Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-Mei Hwu. Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 2012. Google ScholarDigital Library
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (PACT), pages 157--166. IEEE Press, 2013.Google Scholar
Hyeran Jeon, Gunjae Koo, and Murali Annavaram. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08, 2014.Google Scholar
Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pages 583--595. IEEE Press, 2016. Google ScholarDigital Library
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N Patt. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 308--317. ACM, 2011.Google ScholarDigital Library
A. Sethia, D. A. Jamshidi, and S. Mahlke. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 174--185, Feb 2015. Google ScholarCross Ref
Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343. ACM, 2013. Google ScholarDigital Library
Lingda Li, Ari B Hayes, Stephen A Hackler, Eddy Z Zhang, Mario Szegedy, and Shuaiwen Leon Song. A Graph-based Model for GPU Caching Problems. arXiv preprint arXiv:1605.02043, 2016.Google Scholar
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '11, pages 369--380. ACM, 2011. Google ScholarDigital Library
Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the 2016 International Conference on Supercomputing (ICS), page 2. ACM, 2016.Google Scholar
NVIDIA. CUDA SDK Code Samples, 2015.Google Scholar
Konstantin Andreev and Harald Racke. Balanced graph partitioning. Theory of Computing Systems, 39(6):929--939, 2006. Google ScholarDigital Library
Ang Li, Shuaiwen Leon Song, Mark Wijtvliet, Akash Kumar, and Henk Corporaal. SFU-Driven Transparent Approximation Acceleration on GPUs. In Proceedings of the International Conference on Supercomputing (ICS), page 15. ACM, 2016. Google ScholarDigital Library
Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418. ACM, 2013. Google ScholarDigital Library
Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual International Symposium on Microarchitecture (MICRO), pages 213--224. IEEE, 2010.Google ScholarDigital Library
Nagesh B Lakshminarayana and Hyesoon Kim. Spare register aware prefetching for graph algorithms on GPUs. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 614--625. IEEE, 2014.Google ScholarCross Ref
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.Google ScholarDigital Library
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.Google Scholar
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE, 2012. Google ScholarCross Ref
NVIDIA. CUDA Profiler User's Guide, 2015.Google Scholar
Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), pages 15--24. ACM, 2012. Google ScholarDigital Library
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83. IEEE Computer Society, 2012. Google ScholarDigital Library
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual International Symposium on Microarchitecture, MICRO-46, pages 99--110. ACM, 2013. Google ScholarDigital Library
Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 395--407. ACM, 2013. Google ScholarDigital Library

Index Terms

Locality-Aware CTA Clustering for Modern GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Locality-Aware CTA Scheduling for Gaming Applications
The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on maximizing compute work occupancy across all streaming multiprocessors in a GPU while retaining design simplicity. In this article, we identify the operational aspects ...
Read More
Locality-Aware CTA Clustering for Modern GPUs
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-...
Read More
Locality-Aware CTA Clustering for Modern GPUs
Asplos'17

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 52, Issue 4
ASPLOS '17
April 2017
811 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3093336
Editor:
Matthew Fluet
Issue’s Table of Contents
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
April 2017
856 pages
ISBN:9781450344654
DOI:10.1145/3037697
General Chairs:
Yunji Chen
Institute of Computing Technology, CAS, China
,
Olivier Temam
Google, USA
,
Program Chair:
John Carter
IBM, USA
Copyright © 2017 ACM
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2017
Check for updates
Author Tags
cache locality
cta
gpu
performance optimization
runtime tool
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 63
  Total Citations
  View Citations
- 1,075
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Locality-Aware CTA Clustering for Modern GPUs

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Locality-Aware CTA Scheduling for Gaming Applications

Locality-Aware CTA Clustering for Modern GPUs

Locality-Aware CTA Clustering for Modern GPUs