skip to main content
10.1145/3079856.3080231acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

Published:24 June 2017Publication History

ABSTRACT

Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

References

  1. 2005. Xenos: XBOX360 GPU. (2005). http://fileadmin.cs.lth.se/cs/Personal/MichaelDoggett/talks/eg05-xenos-doggett.pdf Accessed: 2016-08-19.Google ScholarGoogle Scholar
  2. 2007. The Xeon X5365. (2007). http://ark.intel.com/products/30702/Intel-Xeon-Processor-X5365-8M-Cache-300-GHz-1333-MHz-FSB Accessed: 2016-08-19.Google ScholarGoogle Scholar
  3. 2011. IBM zEnterprise 196 Technical Guide. (2011). http://www.redbooks.ibm.com/redbooks/pdfs/sg247833.pdf Accessed: 2016-08-19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 2012. AMD Server Solutions Playbook. (2012). http://www.amd.com/Documents/AMD_Opteron_ServerPlaybook.pdf Accessed: 2016-08-19.Google ScholarGoogle Scholar
  5. 2012. IBM Power Systems Deep Dive. (2012). http://www-05.ibm.com/cz/events/febannouncement2012/pdf/power_architecture.pdf Accessed: 2016-08-19.Google ScholarGoogle Scholar
  6. 2014. CORAL Benchmarks. (2014). https://asc.llnl.gov/CORAL-benchmarks/Google ScholarGoogle Scholar
  7. 2015. Intel Delays 10nm to 2017. (2015). http://www.extremetech.com/computing/210050-intel-confirms-10nm-delayed-to-2017-will-introduce-kaby-\lake-at-14nm-to-fill-gapGoogle ScholarGoogle Scholar
  8. 2015. International Technology Roadmap for Semiconductors 2.0. (2015). http://www.itrs2.net/itrs-reports.htmlGoogle ScholarGoogle Scholar
  9. 2015. Switch-IB 2 EDR Switch Silicon - World's First Smart Switch. (2015). http://www.mellanox.com/related-docs/prod_silicon/PB_SwitchIB2_EDR_Switch_Silicon.pdf Accessed: 2016-06-20.Google ScholarGoogle Scholar
  10. 2015. TESLA K80 GPU ACCELERATOR. (2015). https://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf Accessed: 2016-06-20.Google ScholarGoogle Scholar
  11. 2015. The Compute Architecture of Intel Processor Graphics Gen8. (2015). https://software.intel.com Accessed: 2016-08-19.Google ScholarGoogle Scholar
  12. 2015. TOP500 Shows Growing Momentum for Accelerators. (2015). http://insidehpc.com/2015/11/top500-shows-growing-momentum-for-accelerators/ Accessed: 2016-06-20.Google ScholarGoogle Scholar
  13. 2016. ConnectX-4 VPI Single and Dual Port QSFP28 Adapter Card User Manual. (2016). http://www.mellanox.com/related-docs/user_manuals/ConnectX-4_VPI_Single_and_Dual_QSFP28_Port_Adapter_Card_User_Manual.pdf Accessed: 2016-06-20.Google ScholarGoogle Scholar
  14. 2016. Inside Pascal: NVIDIA's Newest Computing Platform. (2016). https://devblogs.nvidia.com/parallelforall/inside-pascal Accessed: 2016-06-20.Google ScholarGoogle Scholar
  15. 2016. NVIDIA cuDNN, GPU Accelerated Deep Learning. (2016). https://developer.nvidia.com/cudnn Accessed: 2016-11-17.Google ScholarGoogle Scholar
  16. 2016. NVIDIA NVLink High-Speed Interconnect. (2016). http://www.nvidia.com/object/nvlink.html Accessed: 2016-06-20.Google ScholarGoogle Scholar
  17. 2016. The New NVIDIA Pascal Architecture. (2016). http://www.nvidia.com/object/gpu-architecture.html Accessed: 2016-06-20.Google ScholarGoogle Scholar
  18. 2016. The TWINSCAN NXT:1950i Dual-Stage Immersion Lithography System. (2016). https://www.asml.com/products/systems/twinscan-nxt/twinscan-nxt1950i/en/s46772?dfp_product_id=822 Accessed: 2016-11-18.Google ScholarGoogle Scholar
  19. 2016. Titan: The world's #1 Open Science Super Computer. (2016). https://www.olcf.ornl.gov/titan/Google ScholarGoogle Scholar
  20. Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. 2015. Memory Access Patterns: The Missing Piece of the multi-GPU Puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, 19:1--19:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sergey Blagodurov, Alexandra Fedorova, Sergey Zhuravlev, and Ali Kamali. 2010. A case for NUMA-aware contention management on multicore systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). IEEE, Vienna, Austria, 557--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. William L. Bolosky, Robert P. Fitzgerald, and Michael L. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP '89). ACM, New York, NY, USA, 19--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 3--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '09). IEEE, Washington, DC, USA, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Long Chen, Oreste Villa, and Guang R. Gao. 2011. Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER '11). IEEE, Washington, DC, USA, 386--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R. Gao. 2010. Dynamic load balancing on single- and multi-GPU systems. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing (IPDPS '10). IEEE, Atlanta, GA, USA, 1--12.Google ScholarGoogle Scholar
  27. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 381--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Daniel Dreps. 2007. The 3rd generation of IBM's elastic interface on POWER6. In Proceedings of the IEEE Hot Chips 19 Symposium (HCS '19). IEEE, 1--16.Google ScholarGoogle ScholarCross RefCross Ref
  29. Michael Feldman, Christopher G. Willard, and Addison Snell. 2015. HPC Application Support for GPU Computing. (2015). http://www.intersect360.com/industry/reports.php?id=131Google ScholarGoogle Scholar
  30. Mitsuya Ishida. 2014. Kyocera APX - An Advanced Organic Technology for 2.5D Interposers. (2014). https://www.ectc.net Accessed: 2016-06-20.Google ScholarGoogle Scholar
  31. Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 546--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing. IEEE Micro 31, 5 (Sept. 2011), 7--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 277--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Richard P. LaRowe Jr., James T. Wilkes, and Carla S. Ellis. 1991. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '91). ACM, New York, NY, USA, 122--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR '16). IEEE, Las Vegas, NV, USA, 4013--4021.Google ScholarGoogle ScholarCross RefCross Ref
  36. Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU Collaboration for Data-parallel Kernels on Heterogeneous Systems. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE, Piscataway, NJ, USA, 245--256. http://dl.acm.org/citation.cfm?id=2523721.2523756 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14). IEEE, Orlando, FL, USA, 260--271.Google ScholarGoogle ScholarCross RefCross Ref
  38. Hui Li, Sudarsan Tandri, Michael Stumm, and Kenneth C. Sevcik. 1993. Locality and Loop Scheduling on NUMA Multiprocessors. In Proceedings of the International Conference on Parallel Processing - Volume 02 (ICPP '93). IEEE, Washington, DC, USA, 140--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zoltan Majo and Thomas R. Gross. 2012. Matching Memory Access Patterns and Data Placement for NUMA Systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 230--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mozhgan Mansuri, James E. Jaussi, Joseph T. Kennedy, Tzu-Chien Hsueh, Sudip Shekhar, Ganesh Balamurugan, Frank O'Mahony, Clark Roberts, Randy Mooney, and Bryan Casper. 2013. A scalable 0.128-to-1Tb/s 0.8-to-2.6pJ/b 64-lane parallel I/O in 32nm CMOS. In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC '13). IEEE, San Francisco, CA, USA, 402--403.Google ScholarGoogle ScholarCross RefCross Ref
  41. Mengjie Mao, Wujie Wen, Xiaoxiao Liu, Jingtong Hu, Danghui Wang, Yiran Chen, and Hai Li. 2016. TEMP: Thread Batch Enabled Memory Partitioning for GPU. In Proceedings of the 53rd Annual Design Automation Conference (DAC '16). ACM, New York, NY, USA, Article 65, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Takuji Mitsuishi, Jun Suzuki, Yuki Hayashi, Masaki Kan, and Hideharu Amano. 2016. Breadth First Search on Cost-efficient Multi-GPU Systems. SIGARCH Comput. Archit. News 43, 4 (April 2016), 58--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Molly A. O'Neil and Martin Burtscher. 2014. Microarchitectural performance characterization of irregular GPU kernels. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '14). IEEE, Raleigh, NC, USA, 130--139.Google ScholarGoogle ScholarCross RefCross Ref
  44. John Poulton, Robert Palmer, Andrew M. Fuller, Trey Greer, John Eyles, William J. Dally, and Mark Horowitz. 2007. A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS. IEEE Journal of Solid-State Circuits 42, 12 (Dec 2007), 2745--2757.Google ScholarGoogle ScholarCross RefCross Ref
  45. John W. Poulton, William J. Dally, Xi Chen, John G. Eyles, Thomas H. Greer, Stephen G. Tell, John M. Wilson, and C. Thomas Gray. 2013. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications. IEEE Journal of Solid-State Circuits 48, 12 (Dec 2013), 3206--3218.Google ScholarGoogle ScholarCross RefCross Ref
  46. Debendra D. Sharma. 2014. PCI Express 3.0 Features and Requirements Gathering for beyond. (2014). https://www.openfabrics.org/downloads/Media/Monterey_2011/Apr5_pcie%20gen3.pdf Accessed: 2016-06-20.Google ScholarGoogle Scholar
  47. Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv e-prints (Sept. 2014). arXiv:cs.CV/1409.1556Google ScholarGoogle Scholar
  48. Bruce W. Smith and Kazuaki Suzuki. 2007. Microlithography: Science and Technology, Second Edition. https://books.google.com/books?id=_hTLDCeIYxoCGoogle ScholarGoogle Scholar
  49. Jeff A. Stuart and John D. Owens. 2009. Message Passing on Data-parallel Architectures. In Proceedings of the IEEE International Symposium on Parallel&Distributed Processing (IPDPS '09). IEEE, Washington, DC, USA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS '11). IEEE, Washington, DC, USA, 1068--1079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. David Tam, Reza Azimi, and Michael Stumm. 2007. Thread Clustering: Sharing-aware Scheduling on SMP-CMP-SMT Multiprocessors. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys '07). ACM, New York, NY, USA, 47--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2016. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE, Piscataway, NJ, USA, 583--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Kenneth M. Wilson and Bob B. Aglietti. 2001. Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC '01). ACM, New York, NY, USA, 33--33. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
        June 2017
        736 pages
        ISBN:9781450348928
        DOI:10.1145/3079856

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 June 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        ISCA '17 Paper Acceptance Rate54of322submissions,17%Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader