research-article

SIMD divergence optimization through intra-warp compaction

Authors:
Aniruddha S. Vaidya

Intel Corporation, Santa Clara, CA

Intel Corporation, Santa Clara, CA
View Profile

,
Anahita Shayesteh

Intel Corporation, Santa Clara, CA

Intel Corporation, Santa Clara, CA
View Profile

,
Dong Hyuk Woo

Intel Corporation, Santa Clara, CA

Intel Corporation, Santa Clara, CA
View Profile

,
Roy Saharoy

Intel Corporation, Santa Clara, CA

Intel Corporation, Santa Clara, CA
View Profile

,
Mani Azimi

Intel Corporation, Santa Clara, CA

Intel Corporation, Santa Clara, CA
View Profile

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureJune 2013Pages 368–379https://doi.org/10.1145/2485922.2485954

Published:23 June 2013Publication History

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 368–379

ABSTRACT

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.

Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

References

AMD Radeon HD 7970 Graphics, AMD. {Online}. Available: amd.comGoogle Scholar
K. Asanovic, "Vector microprocessors," Ph.D. dissertation, UC Berkeley, 1998. Google ScholarDigital Library
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of International Symposium on Performance Analsys of Systems and Software, 2009.Google Scholar
C. F. Batten, "Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators," Ph.D. dissertation, MIT, 2010. Google ScholarDigital Library
N. Brunie, S. Collange, and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 49--60. Google ScholarDigital Library
ILLIAC IV -- System Description, Burroughs Corp, 1974, Computer History Museum resource.Google Scholar
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of International Symposium on Workload Characterization, 2009, pp. 44--54. Google ScholarDigital Library
S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of International Symposium on Workload Characterization, 2010. Google ScholarDigital Library
G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD re-convergence at thread frontiers," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 477--488. Google ScholarDigital Library
R. Espasa and M. Valero, "Multithreaded vector architectures," in International Symposium on High Performance Computer Architecture, 1997, pp. 237--248. Google ScholarDigital Library
W. Fung and T. Aamodt, "Thread block compaction for efficient simt control flow," in International Symposium on High Performance Computer Architecture, 2011, pp. 25--36. Google ScholarDigital Library
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of International Symposium on Microarchitecture, 2007, pp. 407--420. Google ScholarDigital Library
V. George and H. Jiang, "Intel next generation microarchitecture code name IvyBridge," in Intel Developer Forum, 2012, Technology Insight Video.Google Scholar
T. Han and T. Abdelrahman, "Reducing branch divergence in GPU programs," in Workshop on General Purpose Processing on GPU, 2011, p. 3. Google ScholarDigital Library
W. Hwu, Ed., GPU Computing Gems --- Jade and Emerald Eds. Morgan Kaufmann, 2011. Google ScholarDigital Library
DirectX Developer's Guide for Intel Processor Graphics: Maximizing Performance on the New Intel Microarchitecture Codenamed IvyBridge, Intel Corp, April 2012. {Online}. Available: software.intel.comGoogle Scholar
Intel Open Source HD Graphics Programmer's Reference Manual (PRM) for 2012 Intel Core Processor Family (codenamed IvyBridge), Intel Corp, 2012. {Online}. Available: intellinuxgraphics.orgGoogle Scholar
Intel SDK for OpenCL Applications 2012: OpenCL Optimization Guide, Intel Corp, 2012. {Online}. Available: software.intel.comGoogle Scholar
D. Kanter, "Intel's IvyBridge graphics architecture." {Online}. Available: realworldtech.com/ivy-bridge-gpu/Google Scholar
OpenCL - The open standard for parallel programming of heterogeneous systems, The Khronos Group. {Online}. Available: khronos.org/opencl/Google Scholar
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the Tradeoffs between Programmability and Efficiency in Data-parallel Accelerators," in Proceedings of International Symposium on Computer Architecture, 2011, pp. 129--140. Google ScholarDigital Library
A. Levinthal and T. Porter, "Chap-a simd graphics processor," in ACM SIGGRAPH Computer Graphics, vol. 18, no. 3, 1984, pp. 77--82. Google ScholarDigital Library
J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proceedings of International Symposium on Computer Architecture, 2010, pp. 235--246. Google ScholarDigital Library
Compute Shader Overview, Microsoft Corp. {Online}. Available: msdn.microsoft.com/en-us/library/ff476331.aspxGoogle Scholar
V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu, and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 308--317. Google ScholarDigital Library
Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview, Nvidia Corp, November 2006. {Online}. Available: nvidia.comGoogle Scholar
NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Corp, April 2012. {Online}. Available: nvidia.comGoogle Scholar
NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, Nvidia Corp, 2012. {Online}. Available: nvidia.comGoogle Scholar
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of of IEEE, vol. 96, no. 5, pp. 879--899, 2008.Google ScholarCross Ref
M. Rhu and M. Erez, "CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 61--71. Google ScholarDigital Library
S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, "Vector lane threading," in Proceedings of International Conference on Parallel Processing, 2006, pp. 55--64. Google ScholarDigital Library
J. E. Smith, S. G. Faanes, and R. Sugumar, "Vector instruction set support for conditional operations," in Proceedings of International Symposium on Computer Architecture, 2000, pp. 260--269. Google ScholarDigital Library
I. Wald, "Active thread compaction for GPU path tracing," in Proceedings of ACM SIGGRAPH Symposium on High Performance Graphics, 2011, pp. 51--58. Google ScholarDigital Library
D. Woligroski, "AMD A10--4600M review: Mobile trinity gets tested," Tom's Hardware, May 2012. {Online}. Available: tomshardware.comGoogle Scholar

Index Terms

SIMD divergence optimization through intra-warp compaction
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors

Recommendations

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (...
Read More
Dynamic warp subdivision for integrated branch and memory divergence tolerance
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in ...
Read More
SIMD divergence optimization through intra-warp compaction
ICSA '13

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion
ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
SIMD
branch divergence
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 1,052
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SIMD divergence optimization through intra-warp compaction

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

Dynamic warp subdivision for integrated branch and memory divergence tolerance

SIMD divergence optimization through intra-warp compaction