skip to main content
10.1145/2485922.2485954acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

SIMD divergence optimization through intra-warp compaction

Published:23 June 2013Publication History

ABSTRACT

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.

Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

References

  1. AMD Radeon HD 7970 Graphics, AMD. {Online}. Available: amd.comGoogle ScholarGoogle Scholar
  2. K. Asanovic, "Vector microprocessors," Ph.D. dissertation, UC Berkeley, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of International Symposium on Performance Analsys of Systems and Software, 2009.Google ScholarGoogle Scholar
  4. C. F. Batten, "Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators," Ph.D. dissertation, MIT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Brunie, S. Collange, and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. ILLIAC IV -- System Description, Burroughs Corp, 1974, Computer History Museum resource.Google ScholarGoogle Scholar
  7. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of International Symposium on Workload Characterization, 2009, pp. 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of International Symposium on Workload Characterization, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD re-convergence at thread frontiers," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 477--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Espasa and M. Valero, "Multithreaded vector architectures," in International Symposium on High Performance Computer Architecture, 1997, pp. 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Fung and T. Aamodt, "Thread block compaction for efficient simt control flow," in International Symposium on High Performance Computer Architecture, 2011, pp. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of International Symposium on Microarchitecture, 2007, pp. 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. George and H. Jiang, "Intel next generation microarchitecture code name IvyBridge," in Intel Developer Forum, 2012, Technology Insight Video.Google ScholarGoogle Scholar
  14. T. Han and T. Abdelrahman, "Reducing branch divergence in GPU programs," in Workshop on General Purpose Processing on GPU, 2011, p. 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Hwu, Ed., GPU Computing Gems --- Jade and Emerald Eds. Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. DirectX Developer's Guide for Intel Processor Graphics: Maximizing Performance on the New Intel Microarchitecture Codenamed IvyBridge, Intel Corp, April 2012. {Online}. Available: software.intel.comGoogle ScholarGoogle Scholar
  17. Intel Open Source HD Graphics Programmer's Reference Manual (PRM) for 2012 Intel Core Processor Family (codenamed IvyBridge), Intel Corp, 2012. {Online}. Available: intellinuxgraphics.orgGoogle ScholarGoogle Scholar
  18. Intel SDK for OpenCL Applications 2012: OpenCL Optimization Guide, Intel Corp, 2012. {Online}. Available: software.intel.comGoogle ScholarGoogle Scholar
  19. D. Kanter, "Intel's IvyBridge graphics architecture." {Online}. Available: realworldtech.com/ivy-bridge-gpu/Google ScholarGoogle Scholar
  20. OpenCL - The open standard for parallel programming of heterogeneous systems, The Khronos Group. {Online}. Available: khronos.org/opencl/Google ScholarGoogle Scholar
  21. Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the Tradeoffs between Programmability and Efficiency in Data-parallel Accelerators," in Proceedings of International Symposium on Computer Architecture, 2011, pp. 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Levinthal and T. Porter, "Chap-a simd graphics processor," in ACM SIGGRAPH Computer Graphics, vol. 18, no. 3, 1984, pp. 77--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proceedings of International Symposium on Computer Architecture, 2010, pp. 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Compute Shader Overview, Microsoft Corp. {Online}. Available: msdn.microsoft.com/en-us/library/ff476331.aspxGoogle ScholarGoogle Scholar
  25. V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu, and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 308--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview, Nvidia Corp, November 2006. {Online}. Available: nvidia.comGoogle ScholarGoogle Scholar
  27. NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Corp, April 2012. {Online}. Available: nvidia.comGoogle ScholarGoogle Scholar
  28. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, Nvidia Corp, 2012. {Online}. Available: nvidia.comGoogle ScholarGoogle Scholar
  29. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of of IEEE, vol. 96, no. 5, pp. 879--899, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Rhu and M. Erez, "CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 61--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, "Vector lane threading," in Proceedings of International Conference on Parallel Processing, 2006, pp. 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. E. Smith, S. G. Faanes, and R. Sugumar, "Vector instruction set support for conditional operations," in Proceedings of International Symposium on Computer Architecture, 2000, pp. 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. I. Wald, "Active thread compaction for GPU path tracing," in Proceedings of ACM SIGGRAPH Symposium on High Performance Graphics, 2011, pp. 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Woligroski, "AMD A10--4600M review: Mobile trinity gets tested," Tom's Hardware, May 2012. {Online}. Available: tomshardware.comGoogle ScholarGoogle Scholar

Index Terms

  1. SIMD divergence optimization through intra-warp compaction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
        June 2013
        686 pages
        ISBN:9781450320795
        DOI:10.1145/2485922
        • cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
          ICSA '13
          June 2013
          666 pages
          ISSN:0163-5964
          DOI:10.1145/2508148
          Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader