Abstract
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.
- Steven S. Muchnick. Advanced compiler design implementation. Morgan Kaufmann, 1997.Google ScholarDigital Library
- Randy Allen and Ken Kennedy. Optimizing compilers for modern architectures a dependence-based approach. Morgan Kaufmann, 2001.Google Scholar
- Jingling Xue. Loop tiling for parallelism, volume 575. Springer Science & Business Media, 2012.Google Scholar
- James Philbin, Jan Edler, Otto J Anshus, Craig C Douglas, and Kai Li. Thread scheduling for cache locality. In ACM SIGOPS Operating Systems Review, volume 30, pages 60--71. ACM, 1996.Google Scholar
- David Tam, Reza Azimi, and Michael Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM SIGOPS Operating Systems Review, volume 41, pages 47--58. ACM, 2007. Google ScholarDigital Library
- Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008. Google ScholarDigital Library
- NVIDIA. CUDA Programming Guide, 2015.Google Scholar
- Mengjie Mao, Wujie Wen, Xiaoxiao Liu, Jingtong Hu, Danghui Wang, Yiran Chen, and Hai Li. TEMP: thread batch enabled memory partitioning for GPU. In Proceedings of the 53rd Annual Design Automation Conference, page 65. ACM, 2016. Google ScholarDigital Library
- Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235--246. ACM, 2011. Google ScholarDigital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40, pages 407--420. IEEE Computer Society, 2007. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406. ACM, 2013. Google ScholarDigital Library
- Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 17. ACM, 2015. Google ScholarDigital Library
- Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. Locality-driven dynamic GPU cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 67--77. ACM, 2015. Google ScholarDigital Library
- Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 343--355. IEEE Computer Society, 2014. Google ScholarDigital Library
- Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design, ICCAD '13, pages 516--523. IEEE Press, 2013. Google ScholarCross Ref
- R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 25--38, Oct 2015. Google ScholarDigital Library
- Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing, page 43. ACM, 2016.Google Scholar
- Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 86--98. ACM, 2013. Google ScholarDigital Library
- Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 272--283. IEEE, 2014.Google ScholarCross Ref
- Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 76--88. IEEE, 2015. Google ScholarCross Ref
- Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163--174. IEEE, 2009.Google ScholarCross Ref
- Bo-Cheng Charles Lai, Hsien-Kai Kuo, and Jing-Yang Jou. A cache hierarchy aware thread mapping methodology for GPGPUs. IEEE Transactions on Computers (TC), 64(4):884--898, 2015. Google ScholarCross Ref
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. Comput. Syst, 26:63--72, 2009.Google Scholar
- Nikolaj Leischner, Vitaly Osipov, and Peter Sanders. Fermi architecture white paper.Google Scholar
- Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), pages 119--130. ACM, 2015. Google ScholarDigital Library
- Kunal Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), pages 1--14. IEEE, 2012.Google ScholarCross Ref
- Minseok Lee, Seokwoo Song, Joosik Moon, Jung-Ho Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of 20th International Symposium on High Performance Computer Architecture (HPCA), pages 260--271. IEEE, 2014. Google ScholarCross Ref
- NVIDIA. GTX980 Whitepaper: Featuring Maxwell, the Most Advanced GPU Ever Made, 2014.Google Scholar
- Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O'Connor, and Tor M Aamodt. Cache coherence for GPU architectures. In Proceedings of the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 578--590. IEEE, 2013.Google ScholarDigital Library
- Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-Mei Hwu. Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 2012. Google ScholarDigital Library
- Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (PACT), pages 157--166. IEEE Press, 2013.Google Scholar
- Hyeran Jeon, Gunjae Koo, and Murali Annavaram. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08, 2014.Google Scholar
- Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), pages 583--595. IEEE Press, 2016. Google ScholarDigital Library
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N Patt. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 308--317. ACM, 2011.Google ScholarDigital Library
- A. Sethia, D. A. Jamshidi, and S. Mahlke. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 174--185, Feb 2015. Google ScholarCross Ref
- Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343. ACM, 2013. Google ScholarDigital Library
- Lingda Li, Ari B Hayes, Stephen A Hackler, Eddy Z Zhang, Mario Szegedy, and Shuaiwen Leon Song. A Graph-based Model for GPU Caching Problems. arXiv preprint arXiv:1605.02043, 2016.Google Scholar
- Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '11, pages 369--380. ACM, 2011. Google ScholarDigital Library
- Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of the 2016 International Conference on Supercomputing (ICS), page 2. ACM, 2016.Google Scholar
- NVIDIA. CUDA SDK Code Samples, 2015.Google Scholar
- Konstantin Andreev and Harald Racke. Balanced graph partitioning. Theory of Computing Systems, 39(6):929--939, 2006. Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Mark Wijtvliet, Akash Kumar, and Henk Corporaal. SFU-Driven Transparent Approximation Acceleration on GPUs. In Proceedings of the International Conference on Supercomputing (ICS), page 15. ACM, 2016. Google ScholarDigital Library
- Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418. ACM, 2013. Google ScholarDigital Library
- Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual International Symposium on Microarchitecture (MICRO), pages 213--224. IEEE, 2010.Google ScholarDigital Library
- Nagesh B Lakshminarayana and Hyesoon Kim. Spare register aware prefetching for graph algorithms on GPUs. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 614--625. IEEE, 2014.Google ScholarCross Ref
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.Google ScholarDigital Library
- John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.Google Scholar
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE, 2012. Google ScholarCross Ref
- NVIDIA. CUDA Profiler User's Guide, 2015.Google Scholar
- Wenhao Jia, Kelly A Shaw, and Margaret Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM international conference on Supercomputing (ICS), pages 15--24. ACM, 2012. Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83. IEEE Computer Society, 2012. Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual International Symposium on Microarchitecture, MICRO-46, pages 99--110. ACM, 2013. Google ScholarDigital Library
- Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 395--407. ACM, 2013. Google ScholarDigital Library
Index Terms
- Locality-Aware CTA Clustering for Modern GPUs
Recommendations
Locality-Aware CTA Scheduling for Gaming Applications
The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on maximizing compute work occupancy across all streaming multiprocessors in a GPU while retaining design simplicity. In this article, we identify the operational aspects ...
Locality-Aware CTA Clustering for Modern GPUs
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsCache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-...
Locality-Aware CTA Clustering for Modern GPUs
Asplos'17Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-...
Comments