ABSTRACT
SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.
Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.
- AMD Radeon HD 7970 Graphics, AMD. {Online}. Available: amd.comGoogle Scholar
- K. Asanovic, "Vector microprocessors," Ph.D. dissertation, UC Berkeley, 1998. Google ScholarDigital Library
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of International Symposium on Performance Analsys of Systems and Software, 2009.Google Scholar
- C. F. Batten, "Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators," Ph.D. dissertation, MIT, 2010. Google ScholarDigital Library
- N. Brunie, S. Collange, and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 49--60. Google ScholarDigital Library
- ILLIAC IV -- System Description, Burroughs Corp, 1974, Computer History Museum resource.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of International Symposium on Workload Characterization, 2009, pp. 44--54. Google ScholarDigital Library
- S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of International Symposium on Workload Characterization, 2010. Google ScholarDigital Library
- G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD re-convergence at thread frontiers," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 477--488. Google ScholarDigital Library
- R. Espasa and M. Valero, "Multithreaded vector architectures," in International Symposium on High Performance Computer Architecture, 1997, pp. 237--248. Google ScholarDigital Library
- W. Fung and T. Aamodt, "Thread block compaction for efficient simt control flow," in International Symposium on High Performance Computer Architecture, 2011, pp. 25--36. Google ScholarDigital Library
- W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of International Symposium on Microarchitecture, 2007, pp. 407--420. Google ScholarDigital Library
- V. George and H. Jiang, "Intel next generation microarchitecture code name IvyBridge," in Intel Developer Forum, 2012, Technology Insight Video.Google Scholar
- T. Han and T. Abdelrahman, "Reducing branch divergence in GPU programs," in Workshop on General Purpose Processing on GPU, 2011, p. 3. Google ScholarDigital Library
- W. Hwu, Ed., GPU Computing Gems --- Jade and Emerald Eds. Morgan Kaufmann, 2011. Google ScholarDigital Library
- DirectX Developer's Guide for Intel Processor Graphics: Maximizing Performance on the New Intel Microarchitecture Codenamed IvyBridge, Intel Corp, April 2012. {Online}. Available: software.intel.comGoogle Scholar
- Intel Open Source HD Graphics Programmer's Reference Manual (PRM) for 2012 Intel Core Processor Family (codenamed IvyBridge), Intel Corp, 2012. {Online}. Available: intellinuxgraphics.orgGoogle Scholar
- Intel SDK for OpenCL Applications 2012: OpenCL Optimization Guide, Intel Corp, 2012. {Online}. Available: software.intel.comGoogle Scholar
- D. Kanter, "Intel's IvyBridge graphics architecture." {Online}. Available: realworldtech.com/ivy-bridge-gpu/Google Scholar
- OpenCL - The open standard for parallel programming of heterogeneous systems, The Khronos Group. {Online}. Available: khronos.org/opencl/Google Scholar
- Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the Tradeoffs between Programmability and Efficiency in Data-parallel Accelerators," in Proceedings of International Symposium on Computer Architecture, 2011, pp. 129--140. Google ScholarDigital Library
- A. Levinthal and T. Porter, "Chap-a simd graphics processor," in ACM SIGGRAPH Computer Graphics, vol. 18, no. 3, 1984, pp. 77--82. Google ScholarDigital Library
- J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proceedings of International Symposium on Computer Architecture, 2010, pp. 235--246. Google ScholarDigital Library
- Compute Shader Overview, Microsoft Corp. {Online}. Available: msdn.microsoft.com/en-us/library/ff476331.aspxGoogle Scholar
- V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu, and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 308--317. Google ScholarDigital Library
- Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview, Nvidia Corp, November 2006. {Online}. Available: nvidia.comGoogle Scholar
- NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Corp, April 2012. {Online}. Available: nvidia.comGoogle Scholar
- NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, Nvidia Corp, 2012. {Online}. Available: nvidia.comGoogle Scholar
- J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of of IEEE, vol. 96, no. 5, pp. 879--899, 2008.Google ScholarCross Ref
- M. Rhu and M. Erez, "CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 61--71. Google ScholarDigital Library
- S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, "Vector lane threading," in Proceedings of International Conference on Parallel Processing, 2006, pp. 55--64. Google ScholarDigital Library
- J. E. Smith, S. G. Faanes, and R. Sugumar, "Vector instruction set support for conditional operations," in Proceedings of International Symposium on Computer Architecture, 2000, pp. 260--269. Google ScholarDigital Library
- I. Wald, "Active thread compaction for GPU path tracing," in Proceedings of ACM SIGGRAPH Symposium on High Performance Graphics, 2011, pp. 51--58. Google ScholarDigital Library
- D. Woligroski, "AMD A10--4600M review: Mobile trinity gets tested," Tom's Hardware, May 2012. {Online}. Available: tomshardware.comGoogle Scholar
Index Terms
- SIMD divergence optimization through intra-warp compaction
Recommendations
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (...
Dynamic warp subdivision for integrated branch and memory divergence tolerance
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureSIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in ...
SIMD divergence optimization through intra-warp compaction
ICSA '13SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
Comments